Text-to-video retrieval in operating rooms (OR) is an enabling technology for OR safety, as it allows stakeholders to retrieve and inspect recordings of specific events. However, because the most safety-critical events may not follow the common structure, to unlock its full potential text-to-video retrieval must be able to handle implicit queries that require reasoning to identify the right video (e.g., the step right before clipping). However, existing methods rely on global embeddings that cannot reason over such queries. We propose OR3, a text-to-video retrieval method that converts clips into action-driven digital twins (ActDTs), grouping concurrent subject-action-object triplets under non-overlapping temporal intervals. Moreover, rather than cross-modal matching through paired encoders, OR3 performs imagination-based retrieval where an LLM generates hypothetical ActDTs from queries. This enables intra-modal matching via a single encoder trained with ActDT-tailored hard negatives. Finally, evidence-grounded refinement revises imagined ActDTs based on discrepancies with top candidates to capture procedure-specific patterns. We construct a benchmark from MM-OR with 276 implicit queries across four reasoning categories over 386 clips from robotic knee procedures. OR3 achieves 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline. These results demonstrate that OR3 enables fine-grained discrimination between visually similar OR video clips through temporal action reasoning.
翻译:文本-视频检索是提升手术室安全性的关键技术,它允许相关人员检索并查看特定事件的记录。然而,由于最关键的安全事件可能不遵循常规结构,文本-视频检索必须能够处理需要推理的隐式查询(例如“裁剪步骤前的一个步骤”),才能充分发挥其潜力。现有方法依赖全局嵌入,无法对此类查询进行推理。我们提出OR3方法,一种将视频片段转换为动作驱动数字孪生体(ActDTs)的文本-视频检索方法,该方法在非重叠时间间隔内对并发的主体-动作-客体三元组进行分组。此外,OR3并非通过配对编码器进行跨模态匹配,而是执行基于想象的检索:由大语言模型从查询中生成假设的ActDTs,从而通过单一编码器实现模态内匹配(该编码器使用针对ActDT定制的难负样本进行训练)。最后,证据驱动的细化过程根据与最相关候选视频的差异修正假设的ActDTs,以捕捉特定手术程序的模式。我们从MM-OR数据集构建基准,包含机器人膝关节手术中386个视频片段的276个隐式查询(涵盖四个推理类别)。OR3在R@1和R@5指标上分别达到57.6%和77.3%,超越最强基线。结果表明,OR3通过时序动作推理,能够实现对视觉相似的手术视频片段进行细粒度区分。