Automatic evaluations for natural language generation (NLG) conventionally rely on token-level or embedding-level comparisons with text references. This differs from human language processing, for which visual imagination often improves comprehension. In this work, we propose ImaginE, an imagination-based automatic evaluation metric for natural language generation. With the help of StableDiffusion, a state-of-the-art text-to-image generator, we automatically generate an image as the embodied imagination for the text snippet and compute the imagination similarity using contextual embeddings. Experiments spanning several text generation tasks demonstrate that adding machine-generated images with our ImaginE displays great potential in introducing multi-modal information into NLG evaluation, and improves existing automatic metrics' correlations with human similarity judgments in both reference-based and reference-free evaluation scenarios.
翻译:自然语言生成的自动评估通常依赖于与文本参考在词元级别或嵌入层面的比较,这与人类语言处理过程不同——视觉想象力往往能提升人类的理解能力。本文提出ImaginE,一种基于想象力的自然语言生成自动评估指标。借助当前最先进的文本到图像生成器StableDiffusion,我们自动生成图像作为文本片段的具象化想象力表征,并通过上下文嵌入计算想象相似度。涵盖多项文本生成任务的实验表明:通过ImaginE引入机器生成的图像,在NLG评估中融入多模态信息具有巨大潜力,可提升现有自动评估指标在基于参考和无参考两种评估场景中与人类相似性判断的相关性。