Abstract Meaning Representation (AMR) is a semantic formalism that captures the core meaning of an utterance. There has been substantial work developing AMR corpora in English and more recently across languages, though the limited size of existing datasets and the cost of collecting more annotations are prohibitive. With both engineering and scientific questions in mind, we introduce MASSIVE-AMR, a dataset with more than 84,000 text-to-graph annotations, currently the largest and most diverse of its kind: AMR graphs for 1,685 information-seeking utterances mapped to 50+ typologically diverse languages. We describe how we built our resource and its unique features before reporting on experiments using large language models for multilingual AMR and SPARQL parsing as well as applying AMRs for hallucination detection in the context of knowledge base question answering, with results shedding light on persistent issues using LLMs for structured parsing.
翻译:抽象意义表示(AMR)是一种语义形式化方法,用于捕捉语句的核心意义。尽管已有大量工作开发英语AMR语料库,并逐渐扩展至其他语言,但现有数据集的规模限制及收集更多标注的高昂成本构成了显著障碍。基于工程与科学研究的双重考量,本文介绍了MASSIVE-AMR数据集——该资源包含超过84,000条文本到图结构的标注,是目前规模最大、多样性最丰富的AMR数据集:涵盖1,685条信息寻求型语句在50多种类型学上多样语言中的AMR图表示。我们详细阐述了该资源的构建过程及其独特特征,随后报告了使用大语言模型进行多语言AMR解析与SPARQL解析的实验,以及在知识库问答场景中应用AMR进行幻觉检测的研究。实验结果揭示了使用大语言模型进行结构化解析时存在的固有问题。