In Natural Language Generation (NLG) tasks, for any input, multiple communicative goals are plausible, and any goal can be put into words, or produced, in multiple ways. We characterise the extent to which human production varies lexically, syntactically, and semantically across four NLG tasks, connecting human production variability to aleatoric or data uncertainty. We then inspect the space of output strings shaped by a generation system's predicted probability distribution and decoding algorithm to probe its uncertainty. For each test input, we measure the generator's calibration to human production variability. Following this instance-level approach, we analyse NLG models and decoding strategies, demonstrating that probing a generator with multiple samples and, when possible, multiple references, provides the level of detail necessary to gain understanding of a model's representation of uncertainty. Code available at https://github.com/dmg-illc/nlg-uncertainty-probes.
翻译:在自然语言生成(NLG)任务中,对于任何输入,可能存在多种可行的交际目标,且任一目标可通过多种词汇、句法或语义方式表达。我们量化了人类产出在四种NLG任务中的词汇、句法和语义层面的变异程度,并将这种变异性与偶然性(或数据)不确定性相关联。随后,我们通过分析生成系统预测的概率分布与解码算法所塑造的输出字符串空间,探究其不确定性。针对每个测试输入,我们衡量生成器对人类产出变异性的校准程度。基于这一实例级分析框架,我们对NLG模型与解码策略展开研究,结果表明:通过多重采样(在可能情况下结合多重参考)对生成器进行探查,能够提供理解模型不确定性表征所需的必要细节。代码见https://github.com/dmg-illc/nlg-uncertainty-probes。