In this paper, we present WISE, an open-source audiovisual search engine which integrates a range of multimodal retrieval capabilities into a single, practical tool accessible to users without machine learning expertise. WISE supports natural-language and reverse-image queries at both the scene level (e.g. empty street) and object level (e.g. horse) across images and videos; face-based search for specific individuals; audio retrieval of acoustic events using text (e.g. wood creak) or an audio file; search over automatically transcribed speech; and filtering by user-provided metadata. Rich insights can be obtained by combining queries across modalities -- for example, retrieving German trains from a historical archive by applying the object query "train" and the metadata query "Germany", or searching for a face in a place. By employing vector search techniques, WISE can scale to support efficient retrieval over millions of images or thousands of hours of video. Its modular architecture facilitates the integration of new models. WISE can be deployed locally for private or sensitive collections, and has been applied to various real-world use cases. Our code is open-source and available at https://gitlab.com/vgg/wise/wise.
翻译:本文提出WISE,一个开源的视听搜索引擎,它将一系列多模态检索能力集成到一个单一、实用的工具中,可供不具备机器学习专业知识的用户使用。WISE支持在图像和视频上针对场景级别(例如“空荡的街道”)和物体级别(例如“马”)进行自然语言和反向图像查询;支持针对特定个体的人脸搜索;支持使用文本(例如“木头吱嘎声”)或音频文件进行声学事件的音频检索;支持对自动转录的语音进行搜索;并支持按用户提供的元数据进行过滤。通过跨模态组合查询,可以获得丰富的洞察——例如,通过应用物体查询“火车”和元数据查询“德国”,从历史档案中检索德国火车,或在某个地点搜索人脸。通过采用向量搜索技术,WISE能够扩展以支持对数百万张图像或数千小时视频的高效检索。其模块化架构便于集成新模型。WISE可本地部署用于私有或敏感数据集合,并已应用于多种现实世界用例。我们的代码是开源的,可在 https://gitlab.com/vgg/wise/wise 获取。