Short-form video platforms integrate text, visuals, and audio into complex communicative acts, yet existing research analyzes these modalities in isolation, lacking scalable frameworks to interpret their joint contributions. This study introduces a pipeline combining automated multimodal feature extraction with Shapley value-based interpretability to analyze how text, visuals, and audio jointly influence engagement. Applying this framework to 162,965 TikTok videos and 814,825 images about social anxiety disorder (SAD), we find that facial expressions outperform textual sentiment in predicting viewership, informational content drives more attention than emotional support, and cross-modal synergies exhibit threshold-dependent effects. These findings demonstrate how multimodal analysis reveals interaction patterns invisible to single-modality approaches. Methodologically, we contribute a reproducible framework for interpretable multimodal research applicable across domains; substantively, we advance understanding of mental health communication in algorithmically mediated environments.
翻译:短视频平台将文本、视觉与音频整合为复杂的交流行为,然而现有研究往往孤立分析这些模态,缺乏可扩展的框架来解释它们的共同作用。本研究提出一种结合自动化多模态特征提取与基于沙普利值的可解释性分析流程,用以探究文本、视觉和音频如何共同影响用户参与度。将该框架应用于162,965个关于社交焦虑障碍的TikTok视频及814,825张相关图像后发现:面部表情在预测观看量方面优于文本情感分析,信息性内容比情感支持更能吸引关注,且跨模态协同效应呈现阈值依赖性特征。这些发现证明多模态分析能够揭示单模态方法无法观测的交互模式。在方法论层面,我们提出了一个可复现的可解释多模态研究框架,适用于跨领域研究;在实质层面,我们增进了对算法中介环境中心理健康传播机制的理解。