Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain adapted Video to Video Clip Generation framework that integrates Audio Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut & Merge algorithm with fade in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost efficient e2e pipeline strategy balancing ALM/VLM enhanced processing. Evaluations on Video MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3 to 4 times speedup, 4 times cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state of the art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance supporting video summarization for life sciences.
翻译:视觉语言模型(VLM)通过实现智能化、可扩展且自动化的多模态内容处理,有望彻底改变制药行业的数字化转型。传统上对异构数据模态(文本、图像、视频、音频和网络链接)进行人工标注,容易导致内容利用中的不一致性、质量下降和效率低下。海量的长视频和音频数据(例如冗长的临床试验访谈和教育研讨会)进一步加剧了这些挑战。本文提出了一种领域自适应的视频到视频片段生成框架,该框架集成了音频语言模型(ALM)和视觉语言模型(VLM)来生成高光片段。我们的贡献有三方面:(i)一个可复现的“剪切与合并”算法,包含淡入/淡出和时间戳归一化,确保平滑过渡和音画同步;(ii)一种基于角色定义和提示注入的个性化机制,用于生成定制化输出(营销、培训、监管用途);(iii)一种成本高效的端到端流水线策略,平衡了ALM/VLM增强处理。在Video MME基准测试(900个视频)以及我们涵盖14个疾病领域的16,159个药学专有视频数据集上的评估表明,该方法实现了3到4倍的速度提升、4倍的成本降低,并保持了有竞争力的片段质量。除了效率提升,我们还报告了我们的方法在片段连贯性得分(0.348)和信息丰富度得分(0.721)上优于最先进的VLM基线模型(例如Gemini 2.5 Pro),凸显了透明、可定制提取且支持合规性的视频摘要方法在生命科学领域的潜力。