Despite significant progress in Natural Language Generation for Indian languages (IndicNLP), there is a lack of datasets around complex structured tasks such as semantic parsing. One reason for this imminent gap is the complexity of the logical form, which makes English to multilingual translation difficult. The process involves alignment of logical forms, intents and slots with translated unstructured utterance. To address this, we propose an Inter-bilingual Seq2seq Semantic parsing dataset IE-SEMPARSE for 11 distinct Indian languages. We highlight the proposed task's practicality, and evaluate existing multilingual seq2seq models across several train-test strategies. Our experiment reveals a high correlation across performance of original multilingual semantic parsing datasets (such as mTOP, multilingual TOP and multiATIS++) and our proposed IE-SEMPARSE suite.
翻译:尽管印度语言的自然语言生成(IndicNLP)取得了显著进展,但在语义解析等复杂结构化任务方面仍缺乏数据集。造成这一迫切差距的原因之一在于逻辑形式的复杂性,这使得英语到多语言翻译变得困难。该过程涉及将逻辑形式、意图和槽位与翻译后的非结构化话语进行对齐。为解决此问题,我们提出了一个适用于11种不同印度语言的跨语言间序列到序列语义解析数据集IE-SEMPARSE。我们强调了所提出任务的实用性,并评估了现有跨语言序列到序列模型在多种训练-测试策略下的表现。实验表明,原始多语言语义解析数据集(如mTOP、multilingual TOP和multiATIS++)的性能与我们提出的IE-SEMPARSE套件之间存在高度相关性。