When studying political communication, combining the information from text, audio, and video signals promises to reflect the richness of human communication more comprehensively than confining it to individual modalities alone. However, when modeling such multimodal data, its heterogeneity, connectedness, and interaction are challenging to address. We argue that aligning the respective modalities can be an essential step in entirely using the potential of multimodal data because it informs the model with human understanding. Exploring aligned modalities unlocks promising analytical leverage. First, it allows us to make the most of information in the data, which inter alia opens the door to better quality predictions. Second, it is possible to answer research questions that span multiple modalities with cross-modal queries. Finally, alignment addresses concerns about model interpretability. We illustrate the utility of this approach by analyzing how German MPs address members of the far-right AfD in their speeches, and predicting the tone of video advertising in the context of the 2020 US presidential race. Our paper offers important insights to all keen to analyze multimodal data effectively.
翻译:研究政治传播时,结合文本、音频和视频信号中的信息,有望比仅依赖单一模态更全面地反映人类交流的丰富性。然而,对此类多模态数据建模时,其异质性、关联性与交互性仍是亟待解决的难题。我们认为,对齐各模态是充分挖掘多模态数据潜力的关键步骤,因为这一过程能将人类理解机制融入模型。探索对齐模态可释放重要的分析潜力:其一,它能最大化数据中的信息利用率,从而为提升预测质量开辟路径;其二,通过跨模态查询可回答涉及多种模态的研究问题;其三,对齐能有效解决模型可解释性难题。我们通过分析德国议员在演讲中针对极右翼选择党成员的发言,以及预测2020年美国总统大选中视频广告的情感基调,验证了该方法的实用性。本文为所有致力于高效分析多模态数据的研究者提供了重要启示。