Understanding human states and interaction dynamics is a core goal of human-computer interaction (HCI). As interaction paradigms become more immersive, virtual reality (VR) has emerged as a powerful platform for studying collaborative work. In such settings, evaluating team collaboration states, including team performance and team resilience, requires continuous and reliable inference of latent team-level cognitive and affective states from multi-modal sensor data, such as speech signals. However, generating ground truth labels for these latent states remains challenging due to sensor-induced noise, contextual variability, and sparse expert annotations. Traditional self-reporting approaches provide only static and delayed measurements and are therefore insufficient for capturing dynamic team processes reflected in continuous speech data. In this work, we propose a large language model (LLM)-driven, agentic inference workflow for automated emotion-related synthetic ground truth generation from streaming speech data in multi-user VR environments. Leveraging the generalization capabilities of LLMs, we use In-Context Learning (ICL) with few-shot demonstrations of paired audio-based samples and their corresponding transcriptions. ICL tends to achieve task adaptation comparable to model fine-tuning while circumventing the computational overhead of parameter updates. To construct informative and robust in-context prompts, we adopt a retrieval-based selection strategy that dynamically identifies relevant audio demonstrations based on similarity in the acoustic feature space.
翻译:理解人类状态与交互动态是人机交互(HCI)领域的核心目标。随着交互范式向沉浸式发展,虚拟现实(VR)已成为研究协作工作的强大平台。在此类场景中,评估团队协作状态(包括团队表现与团队韧性)需要从语音信号等多模态传感器数据中持续可靠地推断隐性的团队级认知与情感状态。然而,由于传感器噪声、情境变异及专家标注稀疏性,为这些隐性状态生成真实标签仍具挑战性。传统自我报告方法仅能提供静态且滞后的测量值,因此不足以捕捉连续语音数据中反映的动态团队过程。本研究提出一种由大型语言模型(LLM)驱动的智能推理工作流,用于在多用户VR环境中从流式语音数据自动生成与情感相关的合成真实标签。借助LLM的泛化能力,我们采用上下文学习(ICL)策略,通过少量配对音频样本及对应文本转录的示范实现任务适配。相较于模型微调,ICL可在避免参数更新计算开销的同时达到相近的任务适配效果。为构建信息丰富且鲁棒的上下文提示,我们采用基于检索的选择策略,依据声学特征空间中的相似度动态定位相关音频示范。