Live streaming platforms require real-time monitoring and reaction to social signals, utilizing partial and asynchronous evidence from video, text, and audio. We propose StreamSense, a streaming detector that couples a lightweight streaming encoder with selective routing to a Vision-Language Model (VLM) expert. StreamSense handles most timestamps with the lightweight streaming encoder, escalates hard/ambiguous cases to the VLM, and defers decisions when context is insufficient. The encoder is trained using (i) a cross-modal contrastive term to align visual/audio cues with textual signals, and (ii) an IoU-weighted loss that down-weights poorly overlapping target segments, mitigating label interference across segment boundaries. We evaluate StreamSense on multiple social streaming detection tasks (e.g., sentiment classification and hate content moderation), and the results show that StreamSense achieves higher accuracy than VLM-only streaming while only occasionally invoking the VLM, thereby reducing average latency and compute. Our results indicate that selective escalation and deferral are effective primitives for understanding streaming social tasks. Code is publicly available on GitHub.
翻译:直播平台需要对社交信号进行实时监测与响应,这需要利用来自视频、文本和音频的部分异步证据。我们提出StreamSense,一种流式检测器,它将轻量级流式编码器与选择性路由至视觉-语言模型专家相结合。StreamSense在大多数时间戳上使用轻量级流式编码器进行处理,将困难/模糊案例升级至VLM处理,并在上下文不足时推迟决策。编码器通过以下方式训练:(i) 跨模态对比项,以对齐视觉/音频线索与文本信号;(ii) IoU加权损失,该损失降低对重叠不佳的目标片段的权重,从而减轻跨片段边界的标签干扰。我们在多个社交流式检测任务上评估StreamSense,结果表明,StreamSense在仅偶尔调用VLM的情况下,实现了比纯VLM流式处理更高的准确率,从而降低了平均延迟和计算开销。我们的结果表明,选择性升级和推迟是理解流式社交任务的有效基础机制。代码已在GitHub上公开。