Current disfluency detection methods heavily rely on costly and scarce human-annotated data. To tackle this issue, some approaches employ heuristic or statistical features to generate disfluent sentences, partially improving detection performance. However, these sentences often deviate from real-life scenarios, constraining overall model enhancement. In this study, we propose a lightweight data augmentation approach for disfluency detection, utilizing the superior generative and semantic understanding capabilities of large language model (LLM) to generate disfluent sentences as augmentation data. We leverage LLM to generate diverse and more realistic sentences guided by specific prompts, without the need for fine-tuning the LLM. Subsequently, we apply an uncertainty-aware data filtering approach to improve the quality of the generated sentences, utilized in training a small detection model for improved performance. Experiments using enhanced data yielded state-of-the-art results. The results showed that using a small amount of LLM-generated enhanced data can significantly improve performance, thereby further enhancing cost-effectiveness.
翻译:当前的口语不流畅检测方法严重依赖昂贵且稀缺的人工标注数据。为解决这一问题,部分方法采用启发式或统计特征生成不流畅句子,部分提升检测性能。然而,这些句子往往偏离真实场景,限制了模型的整体提升。本研究提出一种轻量级的数据增强方法用于不流畅检测,利用大语言模型(LLM)卓越的生成与语义理解能力生成不流畅句子作为增强数据。我们通过特定提示引导LLM生成多样且更逼真的句子,无需对LLM进行微调。随后,采用不确定性感知的数据过滤方法提升生成句子的质量,并将其用于训练小型检测模型以改进性能。使用增强数据进行的实验取得了最先进的结果。结果表明,使用少量LLM生成的增强数据即可显著提升性能,从而进一步提高成本效益。