支持戒烟互助群的对话代理的数据增强策略 (Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups)

Online support groups for smoking cessation are economical and accessible, yet they often face challenges with low user engagement and stigma. The use of an automatic conversational agent would improve engagement by ensuring that all user comments receive a timely response.). We address the challenge of insufficient high-quality data by employing a two-level data augmentation strategy: synthetic data augmentation and real data augmentation. First, we fine-tuned an open source LLM to classify posts from our existing smoking cessation support groups and identify intents with low F1 (precision+recall) scores. Then, for these intents, we generate additional synthetic data using prompt engineering with the GPT model, with an average of 87\% of the generated synthetic posts deemed high quality by human annotators. Overall, the synthetic augmentation process resulted in 43\% of the original posts being selected for augmentation, followed by 140\% synthetic expansion of these posts. Additionally, we scraped more than 10,000 real posts from a related online support context, of which 73\% were validated as good quality by human annotators. Each synthetic or scraped post underwent rigorous validation involving human reviewers to ensure quality and relevance. The validated new data, combined with the original support group posts, formed an augmented dataset used to retrain the intent classifier. Performance evaluation of the retrained model demonstrated a 32\% improvement in F1, confirming the effectiveness of our data augmentation approach. Synthetic and real post augmentation led to similar performance improvements. This study provides a replicable framework for enhancing conversational agent performance in domains where data scarcity is a critical issue.

翻译：在线戒烟互助群组虽经济便捷，却常面临用户参与度低与污名化等挑战。采用自动对话代理可确保所有用户评论获得及时回应，从而提升参与度。针对高质量数据不足的问题，我们采用双层数据增强策略：合成数据增强与真实数据增强。首先，我们微调了一个开源LLM，对现有戒烟互助群组的帖子进行分类，并识别F1（精确率+召回率）分数较低的意图类别。随后，针对这些意图，我们通过GPT模型的提示工程生成额外的合成数据，其中平均87%的合成帖子经人工标注者评定为高质量。整体而言，合成增强过程筛选了原始帖子中43%的内容进行增强，并在此基础上实现了140%的合成扩展。此外，我们从相关在线支持场景中爬取了超过10,000条真实帖子，其中73%经人工验证为优质内容。每条合成或爬取的帖子均经过人工评审员的严格验证，以确保质量与相关性。经验证的新数据与原始互助群组帖子结合，构成增强数据集，用于重新训练意图分类器。重训练模型的性能评估显示F1分数提升了32%，证实了我们数据增强方法的有效性。合成与真实帖子增强带来了相近的性能提升。本研究为在数据稀缺领域提升对话代理性能提供了一个可复现的框架。