Sociocultural norms serve as guiding principles for personal conduct in social interactions, emphasizing respect, cooperation, and appropriate behavior, which is able to benefit tasks including conversational information retrieval, contextual information retrieval and retrieval-enhanced machine learning. We propose a scalable approach for constructing a Sociocultural Norm (SCN) Base using Large Language Models (LLMs) for socially aware dialogues. We construct a comprehensive and publicly accessible Chinese Sociocultural NormBase. Our approach utilizes socially aware dialogues, enriched with contextual frames, as the primary data source to constrain the generating process and reduce the hallucinations. This enables extracting of high-quality and nuanced natural-language norm statements, leveraging the pragmatic implications of utterances with respect to the situation. As real dialogue annotated with gold frames are not readily available, we propose using synthetic data. Our empirical results show: (i) the quality of the SCNs derived from synthetic data is comparable to that from real dialogues annotated with gold frames, and (ii) the quality of the SCNs extracted from real data, annotated with either silver (predicted) or gold frames, surpasses that without the frame annotations. We further show the effectiveness of the extracted SCNs in a RAG-based (Retrieval-Augmented Generation) model to reason about multiple downstream dialogue tasks.
翻译:社会文化规范作为社交互动中个人行为的指导原则,强调尊重、合作与得体举止,能够有效促进对话信息检索、上下文信息检索及检索增强机器学习等任务的发展。本文提出一种利用大语言模型构建社会文化规范库的可扩展方法,以支持社交感知对话系统。我们构建了一个全面且可公开访问的中文社会文化规范库。该方法以富含情境框架的社交感知对话作为主要数据源,通过约束生成过程来减少幻觉现象,从而利用话语在特定情境下的语用含义,提取高质量且细致入微的自然语言规范陈述。鉴于标注有黄金框架的真实对话数据难以获取,我们提出采用合成数据作为替代方案。实证结果表明:(1)从合成数据中提取的社会文化规范质量与基于黄金框架标注的真实对话数据相当;(2)无论使用银级(预测)还是黄金框架进行标注,从真实数据中提取的社会文化规范质量均显著优于无框架标注的情况。我们进一步通过基于检索增强生成(RAG)的模型验证了所提取社会文化规范在多项下游对话任务推理中的有效性。