This paper presents a new approach to form-filling by reformulating the task as multimodal natural language Question Answering (QA). The reformulation is achieved by first translating the elements on the GUI form (text fields, buttons, icons, etc.) to natural language questions, where these questions capture the element's multimodal semantics. After a match is determined between the form element (Question) and the user utterance (Answer), the form element is filled through a pre-trained extractive QA system. By leveraging pre-trained QA models and not requiring form-specific training, this approach to form-filling is zero-shot. The paper also presents an approach to further refine the form-filling by using multi-task training to incorporate a potentially large number of successive tasks. Finally, the paper introduces a multimodal natural language form-filling dataset Multimodal Forms (mForms), as well as a multimodal extension of the popular ATIS dataset to support future research and experimentation. Results show the new approach not only maintains robust accuracy for sparse training conditions but achieves state-of-the-art F1 of 0.97 on ATIS with approximately 1/10th of the training data.
翻译:本文提出了一种新的表格填充方法,通过将任务重新表述为多模态自然语言问答(QA)来实现。这一重构过程首先将GUI表单中的元素(文本字段、按钮、图标等)转化为自然语言问题,这些问题捕捉了元素的多模态语义。在确定表单元素(问题)与用户话语(答案)之间的匹配后,通过预训练的抽取式问答系统填充表单元素。由于利用了预训练的QA模型且无需针对特定表单进行训练,这种表格填充方法是零样本的。本文还提出了一种通过多任务训练来纳入大量连续任务,从而进一步优化表格填充的方法。最后,本文介绍了多模态自然语言表格填充数据集Multimodal Forms(mForms),以及流行的ATIS数据集的多模态扩展版本,以支持未来的研究和实验。结果表明,新方法不仅在稀疏训练条件下保持了鲁棒精度,而且在仅使用约1/10训练数据的情况下,在ATIS上达到了0.97的F1值,实现了当前最优性能。