Graph-matching metrics such as Smatch are the de facto standard for evaluating neural semantic parsers, yet they capture surface overlap rather than logical equivalence. We reassess evaluation by pairing graph-matching with automated theorem proving. We compare two approaches to building parsers: supervised fine-tuning (T5-Small/Base) and few-shot in-context learning (GPT-4o/4.1/5), under normalized and unnormalized targets. We evaluate outputs using graph-matching, bidirectional entailment between source and target formulas with a first-order logic theorem prover, and well-formedness. Across settings, we find that models performing well on graph-matching often fail to produce logically equivalent formulas. Normalization reduces incidental target variability, improves well-formedness, and strengthens logical adequacy. Error analysis shows performance degrades with increasing formula complexity and with coordination, prepositional phrases, and passive voice; the dominant failures involve variable binding and indexing, and predicate naming. These findings highlight limits of graph-based metrics for reasoning-oriented applications and motivate logic-sensitive evaluation and training objectives together with simplified, normalized target representations. All code and data for our experiments are publicly available.
翻译:诸如Smatch等图匹配指标已成为评估神经语义解析器的实际标准,然而它们捕捉的是表层重叠而非逻辑等价性。我们通过将图匹配与自动定理证明相结合,重新评估了现有方法。我们比较了两种构建解析器的策略:监督微调(T5-Small/Base)与少样本上下文学习(GPT-4o/4.1/5),并分别在归一化和未归一化的目标表示下进行实验。我们使用图匹配、基于一阶逻辑定理证明器的源公式与目标公式之间的双向蕴涵验证,以及良构性检查来评估输出结果。在各种实验设置下,我们发现,在图匹配指标上表现优异的模型常常无法生成逻辑等价的公式。归一化处理减少了目标表示的非本质变异,提升了输出良构性,并增强了逻辑充分性。错误分析表明,性能随公式复杂度的增加而下降,并在处理并列结构、介词短语和被动语态时表现不佳;主要的失败案例涉及变量绑定与索引问题,以及谓词命名。这些发现凸显了基于图的指标在面向推理的应用中的局限性,并呼吁采用对逻辑敏感的评估与训练目标,同时配合简化、归一化的目标表示。我们实验的所有代码与数据均已公开。