The growing availability of health-related instructional videos creates new opportunities for clinical training, patient rehabilitation, and health education, yet existing retrieval systems remain largely single-turn: a user submits one query and receives one ranked list. This interaction is brittle in health scenarios, where information needs are often vague at first and become clinically meaningful only after follow-up constraints such as posture, hand placement, contraindications, equipment, or patient condition are specified. We introduce interactive multi-turn semantic retrieval for health videos and construct MHVRC, a Multi-Turn Health Video Retrieval Corpus, by combining video-grounded descriptions from VideoChat-Flash with query refinements generated by DeepSeek. We further propose DATR, a Dialogue-Aware Two-Stage Retrieval framework. DATR first performs efficient coarse retrieval with a CLIP-style dual encoder and sparse frame sampling, then re-ranks the top candidates through multi-turn query fusion and a lightweight cross-encoder scoring module. Experiments on MHVRC show consistent gains over strong text-video retrieval baselines, while user studies indicate that refined multi-turn queries better capture fine-grained procedural semantics than single-turn annotations. The work establishes a benchmark and a scalable technical recipe for interactive health video retrieval.
翻译:随着健康相关教学视频的日益普及,临床培训、患者康复及健康教育迎来新机遇,然而现有检索系统仍以单轮交互为主:用户提交一次查询即获得一个排序列表。这种交互模式在健康场景下存在局限性——用户信息需求通常初始模糊,需经姿势、手部位置、禁忌症、设备或患者状况等后续约束条件明确后才具有临床意义。我们提出面向健康视频的交互式多轮语义检索方法,通过结合VideoChat-Flash生成的视频描述与DeepSeek产生的查询细化,构建多轮健康视频检索语料库MHVRC。进一步提出对话感知两阶段检索框架DATR:首先采用CLIP式双编码器与稀疏帧采样实现高效粗检索,然后通过多轮查询融合及轻量级交叉编码器评分模块对候选结果重排序。在MHVRC上的实验表明,该方法较之强文本-视频检索基线取得持续提升;用户研究显示,相较于单轮标注,细化的多轮查询能更准确捕获细粒度过程语义。该工作为交互式健康视频检索建立了基准与可扩展的技术方案。