This work focuses on in-context data augmentation for intent detection. Having found that augmentation via in-context prompting of large pre-trained language models (PLMs) alone does not improve performance, we introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model. Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents. It then employs intent-aware filtering, based on PVI, to remove datapoints that are not helpful to the downstream intent classifier. Our method is thus able to leverage the expressive power of large language models to produce diverse training data. Empirical results demonstrate that our method can produce synthetic training data that achieve state-of-the-art performance on three challenging intent detection datasets under few-shot settings (1.28% absolute improvement in 5-shot and 1.18% absolute in 10-shot, on average) and perform on par with the state-of-the-art in full-shot settings (within 0.01% absolute, on average).
翻译:本文聚焦于意图检测中的上下文数据增强方法。研究发现,仅通过大规模预训练语言模型(PLMs)的上下文提示进行数据增强无法提升性能,为此我们提出了一种基于PLMs与逐点V信息(PVI)的新型方法——该指标可衡量数据点对模型训练的有用性。该方法首先在少量种子训练数据上微调PLM,随后合成与给定意图对应的新数据点(即话语)。接着,基于PVI的意图感知过滤机制被用于移除对下游意图分类器无帮助的数据点。该方法能够充分发挥大语言模型的表达能力,生成多样化的训练数据。实验结果表明,在少样本场景下(5-shot平均绝对提升1.28%,10-shot平均绝对提升1.18%),该方法合成的训练数据在三个具有挑战性的意图检测数据集上取得了最优性能;在全样本场景下(平均绝对差异0.01%以内),其表现与现有最优方法持平。