This study explores a streamlined facial data collection method for conversational contexts, addressing the limitations of existing approaches that often require extensive datasets and prioritize technical metrics over user perception and experience. We systematically investigate which facial expression data are essential for reconstructing photorealistic avatars and how they can be captured efficiently. Our research employs a two-phase methodology to identify efficient facial data collection strategies and evaluate their effectiveness. In the first phase, we conduct facial data acquisition and evaluate reconstruction performance using utterance data and emotional data. In the second phase, we carry out a comprehensive user evaluation comparing three progressive conditions: utterance only, utterance and emotional data, and a control condition involving extensive data. Findings from 24 participants engaged in simulated face-to-face conversations reveal that targeted utterance and emotional data achieve comparable levels of perceived realism, naturalness, and telepresence, while reducing training time and data usage when compared to the extensive data collection approach. These results demonstrate that targeted data inputs can enable efficient avatar face reconstruction, offering practical guidelines for real-time applications such as AR/VR telepresence and highlighting the trade-off between data quantity and perceived quality.
翻译:本研究探索了一种面向对话场景的简化面部数据采集方法,旨在解决现有方法通常需要大量数据集且优先考虑技术指标而非用户感知与体验的局限性。我们系统性地研究了哪些面部表情数据对于重建逼真的虚拟形象至关重要,以及如何高效地捕捉这些数据。本研究采用两阶段方法来确定高效的面部数据采集策略并评估其有效性。在第一阶段,我们进行面部数据采集,并使用语音数据和情感数据评估重建性能。在第二阶段,我们开展了一项全面的用户评估,比较了三种渐进条件:仅语音数据、语音与情感数据结合,以及涉及大量数据的对照条件。对24名参与模拟面对面对话的受试者的研究结果表明,与大量数据采集方法相比,有针对性的语音与情感数据在感知真实感、自然度和临场感方面达到了相当的水平,同时减少了训练时间和数据使用量。这些结果证明,有针对性的数据输入能够实现高效的虚拟形象面部重建,为AR/VR临场感等实时应用提供了实用指导,并揭示了数据量与感知质量之间的权衡关系。