Compositional generalization benchmarks for semantic parsing seek to assess whether models can accurately compute meanings for novel sentences, but operationalize this in terms of logical form (LF) prediction. This raises the concern that semantically irrelevant details of the chosen LFs could shape model performance. We argue that this concern is realized for the COGS benchmark. COGS poses generalization splits that appear impossible for present-day models, which could be taken as an indictment of those models. However, we show that the negative results trace to incidental features of COGS LFs. Converting these LFs to semantically equivalent ones and factoring out capabilities unrelated to semantic interpretation, we find that even baseline models get traction. A recent variable-free translation of COGS LFs suggests similar conclusions, but we observe this format is not semantically equivalent; it is incapable of accurately representing some COGS meanings. These findings inform our proposal for ReCOGS, a modified version of COGS that comes closer to assessing the target semantic capabilities while remaining very challenging. Overall, our results reaffirm the importance of compositional generalization and careful benchmark task design.
翻译:摘要:语义解析的组合泛化基准旨在评估模型能否准确计算新句子的语义,但这一目标通常通过逻辑形式(LF)预测来实现。这引发了担忧:所选LF中与语义无关的细节可能会影响模型性能。我们论证这一担忧在COGS基准中确实存在。COGS设计了看似现代模型无法解决的泛化拆分,这可能被视为对模型的指控。然而,我们表明负面结果源于COGS逻辑形式的偶然特征。通过将这些逻辑形式转换为语义等价的形式,并剥离与语义解释无关的能力,我们发现即使是基线模型也能取得进展。近期一种无变量翻译的COGS逻辑形式格式似乎指向类似结论,但我们观察到该格式在语义上不等价:它无法准确表示某些COGS语义。这些发现促成了我们提出ReCOGS——COGS的改进版本,在保持极高挑战性的同时,更接近评估目标语义能力的目标。总体而言,我们的结果再次印证了组合泛化的重要性以及基准任务设计的严谨性。