Current disfluency detection methods heavily rely on costly and scarce human-annotated data. To tackle this issue, some approaches employ heuristic or statistical features to generate disfluent sentences, partially improving detection performance. However, these sentences often deviate from real-life scenarios, constraining overall model enhancement. In this study, we propose a lightweight data augmentation approach for disfluency detection, utilizing the superior generative and semantic understanding capabilities of large language model (LLM) to generate disfluent sentences as augmentation data. We leverage LLM to generate diverse and more realistic sentences guided by specific prompts, without the need for fine-tuning the LLM. Subsequently, we apply an uncertainty-aware data filtering approach to improve the quality of the generated sentences, utilized in training a small detection model for improved performance. Experiments using enhanced data yielded state-of-the-art results. The results showed that using a small amount of LLM-generated enhanced data can significantly improve performance, thereby further enhancing cost-effectiveness. Our code is available here.
翻译:当前的不流畅检测方法严重依赖成本高昂且稀缺的人工标注数据。为解决这一问题,部分方法采用启发式或统计特征生成不流畅语句,虽能部分提升检测性能,但这些语句常偏离真实场景,制约了模型的整体提升。本研究提出一种轻量级数据增强方法用于不流畅检测,利用大型语言模型卓越的生成能力和语义理解能力,生成不流畅语句作为增强数据。我们通过特定提示词引导LLM生成多样化且更贴近现实的语句,无需对LLM进行微调。随后采用不确定性感知数据过滤方法提升生成语句质量,并用于训练小型检测模型以提升性能。使用增强数据进行的实验取得了最先进的结果。结果表明,使用少量LLM生成的增强数据即可显著提升性能,从而进一步提高成本效益。我们的代码公开于此。