Recent work has shown that generation from a prompted or fine-tuned language model can perform well at semantic parsing when the output is constrained to be a valid semantic representation. We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing, that includes context-free grammars for seven semantic parsing datasets and two syntactic parsing datasets with varied output representations, as well as a constrained decoding interface to generate only valid outputs covered by these grammars. We provide low, medium, and high resource splits for each dataset, allowing accurate comparison of various language models under different data regimes. Our benchmark supports evaluation of language models using prompt-based learning as well as fine-tuning. We benchmark eight language models, including two GPT-3 variants available only through an API. Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
翻译:近期研究表明,当输出被约束为合法的语义表示时,基于提示或微调的语言模型生成的语义解析结果表现良好。我们提出BenchCLAMP(约束语言模型解析评估基准),该基准包含七个语义解析数据集和两个句法解析数据集的上下文无关文法(涵盖不同输出表示),并提供约束解码接口以确保仅生成文法覆盖范围内的合法输出。针对每个数据集,我们提供低资源、中资源和高资源三种数据划分,从而在不同数据规模下实现语言模型的精准对比。该基准支持基于提示学习和微调的语言模型评估。我们对包括两个仅通过API提供的GPT-3变体在内的八种语言模型进行基准测试。实验表明,当模型输出被约束为合法形式时,编码器-解码器预训练语言模型在句法与语义解析任务上可达到甚至超越当前最优方法的性能水平。