Bridging Short Videos and Live Streams: Reasoning-Guided Multimodal LLMs for Cross-Domain Representation Learning

As live streaming services grow, many platforms offer short videos and live streams to meet diverse needs. Short videos carry substantial traffic and rich behavior signals, whereas live streaming is a core conversion scenario with sparse behavior data, making cold start severe. Transferring user interests from short videos to live streaming recommendation can alleviate these issues. Meanwhile, short videos and live streams are complex multimodal items, and integrating multimodal signals improves recommendation performance. Although Multimodal Large Language Models (MLLMs) show strong multimodal understanding and reasoning, their application to cross-domain recommendation remains underexplored. To this end, we propose Reasoning-Guided Cross-Domain Representation Learning (RGCD-Rep), a reasoning-guided framework for cross-domain recommendation from short videos to live streams. RGCD-Rep introduces MLLM reasoning resource-efficiently and learns transferable item representations guided by behavioral collaboration via two-stage training. First, reasoning-aware distillation lets a frozen teacher MLLM generate structured cross-domain reasoning knowledge and distills it into a lightweight student MLLM. Second, transferability-guided cross-domain representation learning decomposes item representations into transferable and domain residual representations. The resulting representations are computed offline and integrated into downstream retrieval tasks, enabling low-cost industrial deployment. Extensive offline experiments demonstrate RGCD-Rep's superiority. After deployment in Kuaishou's live streaming recommendation system, A/B tests show significant gains across multiple core business metrics, confirming its effectiveness and practicality in real industrial scenarios. RGCD-Rep is fully deployed and serves over 400 million users daily.

翻译：随着直播服务的发展，许多平台同时提供短视频与直播以满足多元化需求。短视频承载巨大流量与丰富行为信号，而直播作为核心转化场景存在行为数据稀疏问题，导致冷启动严重。将用户兴趣从短视频迁移至直播推荐可缓解该问题。同时，短视频与直播均为复杂多模态内容，整合多模态信号能提升推荐性能。尽管多模态大语言模型（MLLMs）展现出强大的多模态理解与推理能力，但其在跨域推荐中的应用仍属空白。为此，我们提出推理引导跨域表征学习（RGCD-Rep）框架，实现从短视频到直播的跨域推荐。RGCD-Rep高效利用MLLM推理能力，通过两阶段训练学习由行为协作引导的可迁移项目表征。首先，推理感知蒸馏利用冻结的教师MLLM生成结构化跨域推理知识，并将其蒸馏至轻量级学生MLLM。其次，迁移性引导的跨域表征学习将项目表征分解为可迁移表征与域残差表征。最终表征经离线计算后集成至下游检索任务，实现低成本工业部署。大规模离线实验证明了RGCD-Rep的优越性。在快手直播推荐系统部署后，A/B测试显示多个核心业务指标显著提升，验证了其在真实工业场景中的有效性与实用性。RGCD-Rep已全面部署，每日服务超过4亿用户。