NLP tasks are typically defined extensionally through datasets containing example instantiations (e.g., pairs of image i and text t), but motivated intensionally through capabilities invoked in verbal descriptions of the task (e.g., "t is a description of i, for which the content of i needs to be recognised and understood"). We present Pento-DIARef, a diagnostic dataset in a visual domain of puzzle pieces where referring expressions are generated by a well-known symbolic algorithm (the "Incremental Algorithm"), which itself is motivated by appeal to a hypothesised capability (eliminating distractors through application of Gricean maxims). Our question then is whether the extensional description (the dataset) is sufficient for a neural model to pick up the underlying regularity and exhibit this capability given the simple task definition of producing expressions from visual inputs. We find that a model supported by a vision detection step and a targeted data generation scheme achieves an almost perfect BLEU@1 score and sentence accuracy, whereas simpler baselines do not.
翻译:NLP任务通常通过包含示例实例(如图像i与文本t对)的数据集进行外延定义,但其动机往往源于对任务口头描述所调用的能力(例如,“t是对i的描述,为此需要识别并理解i的内容”)。我们提出Pento-DIARef——一个针对拼图视觉领域的诊断性数据集,其中参照表达式由著名的符号算法(“增量算法”)生成,该算法本身基于对假设能力(通过应用格莱斯准则消除干扰项)的诉求而提出。我们的研究问题在于:给定从视觉输入生成表达式的简单任务定义,外延描述(数据集)是否足以使神经模型捕捉底层规律性并展现这种能力?实验表明,在视觉检测步骤与目标导向的数据生成方案支持下,模型可达到近乎完美的BLEU@1分数与句子准确率,而简单的基线模型则无法实现。