Audiovisual (AV) archives are invaluable for holistically preserving the past. Unlike other forms, AV archives can be difficult to explore. This is not only because of its complex modality and sheer volume but also the lack of appropriate interfaces beyond keyword search. The recent rise in text-to-video retrieval tasks in computer science opens the gate to accessing AV content more naturally and semantically, able to map natural language descriptive sentences to matching videos. However, applications of this model are rarely seen. The contribution of this work is threefold. First, working with RTS (T\'el\'evision Suisse Romande), we identified the key blockers in a real archive for implementing such models. We built a functioning pipeline for encoding raw archive videos to the text-to-video feature vectors. Second, we designed and verified a method to encode and retrieve videos using emotionally abundant descriptions not supported in the original model. Third, we proposed an initial prototype for immersive and interactive exploration of AV archives in a latent space based on the previously mentioned encoding of videos.
翻译:音视频档案对于全面保存历史具有不可替代的价值。与其他形式相比,音视频档案的探索难度较大,这不仅源于其复杂的模态和海量的规模,更在于除关键词搜索外缺乏合适的交互界面。近年来计算机科学中文本-视频检索任务的兴起,为以更自然、更语义化的方式访问音视频内容开辟了通道,能够将自然语言描述句映射至匹配视频。然而该模型的应用实践尚属罕见。本研究贡献体现在三个方面:首先,通过与瑞士罗曼德电视台合作,识别出真实档案库中实现此类模型的关键障碍,构建了将原始档案视频编码为文本-视频特征向量的可运行流水线;其次,设计并验证了一种利用原始模型不支持的丰富情感描述进行视频编码与检索的方法;最后,基于前述视频编码技术,提出了在潜空间中实现音视频档案沉浸式交互探索的初始原型。