Decoding the Hook: A Multimodal LLM Framework for Analyzing the Hooking Period of Video Ads

Video-based ads are a vital medium for brands to engage consumers, with social media platforms leveraging user data to optimize ad delivery and boost engagement. A crucial but under-explored aspect is the 'hooking period', the first three seconds that capture viewer attention and influence engagement metrics. Analyzing this brief window is challenging due to the multimodal nature of video content, which blends visual, auditory, and textual elements. Traditional methods often miss the nuanced interplay of these components, requiring advanced frameworks for thorough evaluation. This study presents a framework using transformer-based multimodal large language models (MLLMs) to analyze the hooking period of video ads. It tests two frame sampling strategies, uniform random sampling and key frame selection, to ensure balanced and representative acoustic feature extraction, capturing the full range of design elements. The hooking video is processed by state-of-the-art MLLMs to generate descriptive analyses of the ad's initial impact, which are distilled into coherent topics using BERTopic for high-level abstraction. The framework also integrates features such as audio attributes and aggregated ad targeting information, enriching the feature set for further analysis. Empirical validation on large-scale real-world data from social media platforms demonstrates the efficacy of our framework, revealing correlations between hooking period features and key performance metrics like conversion per investment. The results highlight the practical applicability and predictive power of the approach, offering valuable insights for optimizing video ad strategies. This study advances video ad analysis by providing a scalable methodology for understanding and enhancing the initial moments of video advertisements.

翻译：视频广告是品牌吸引消费者的重要媒介，社交媒体平台利用用户数据优化广告投放并提升互动效果。其中，“黄金三秒”（即广告前三秒）作为吸引观众注意力并影响互动指标的关键时段，其重要性尚未得到充分探索。由于视频内容融合了视觉、听觉和文本元素的多模态特性，分析这一短暂时段具有挑战性。传统方法往往忽略这些要素间的微妙相互作用，需要先进框架进行全面评估。本研究提出一种基于Transformer架构的多模态大语言模型（MLLMs）框架，用于分析视频广告的吸引力时段。该框架测试了均匀随机采样和关键帧选择两种帧采样策略，以确保平衡且具代表性的声学特征提取，全面捕捉设计元素。吸引力时段视频通过前沿的MLLMs进行处理，生成对广告初始影响的描述性分析，并利用BERTopic将其提炼为连贯主题以实现高层次抽象。该框架还整合了音频属性与聚合广告定向信息等特征，丰富了后续分析的特征集。基于社交媒体平台大规模真实数据的实证验证表明，本框架能有效揭示吸引力时段特征与关键绩效指标（如投资转化率）之间的相关性。结果凸显了该方法的实际适用性与预测能力，为优化视频广告策略提供了宝贵见解。本研究通过提供可扩展的方法论来理解和增强视频广告的初始时刻，推动了视频广告分析领域的发展。