Image-text matching for large-scale book collections

We address the problem of detecting and mapping all books in a collection of images to entries in a given book catalogue. Instead of performing independent retrieval for each book detected, we treat the image-text mapping problem as a many-to-many matching process, looking for the best overall match between the two sets. We combine a state-of-the-art segmentation method (SAM) to detect book spines and extract book information using a commercial OCR. We then propose a two-stage approach for text-image matching, where CLIP embeddings are used first for fast matching, followed by a second slower stage to refine the matching, employing either the Hungarian Algorithm or a BERT-based model trained to cope with noisy OCR input and partial text matches. To evaluate our approach, we publish a new dataset of annotated bookshelf images that covers the whole book collection of a public library in Spain. In addition, we provide two target lists of book metadata, a closed-set of 15k book titles that corresponds to the known library inventory, and an open-set of 2.3M book titles to simulate an open-world scenario. We report results on two settings, on one hand on a matching-only task, where the book segments and OCR is given and the objective is to perform many-to-many matching against the target lists, and a combined detection and matching task, where books must be first detected and recognised before they are matched to the target list entries. We show that both the Hungarian Matching and the proposed BERT-based model outperform a fuzzy string matching baseline, and we highlight inherent limitations of the matching algorithms as the target increases in size, and when either of the two sets (detected books or target book list) is incomplete. The dataset and code are available at https://github.com/llabres/library-dataset

翻译：我们致力于解决在图像集合中检测所有书籍并将其映射到给定图书目录条目的问题。与对每本检测到的书籍进行独立检索不同，我们将图像-文本映射问题视为一个多对多匹配过程，寻求两个集合之间的最佳整体匹配。我们结合最先进的分割方法（SAM）来检测书脊，并使用商业OCR提取书籍信息。随后，我们提出了一种两阶段的文本-图像匹配方法：首先使用CLIP嵌入进行快速匹配，随后通过第二阶段（速度较慢）来优化匹配，该阶段采用匈牙利算法或一个基于BERT的模型（该模型经过训练以处理噪声OCR输入和部分文本匹配）。为评估我们的方法，我们发布了一个新的带标注书架图像数据集，该数据集覆盖了西班牙某公共图书馆的全部馆藏。此外，我们提供了两个目标书籍元数据列表：一个是包含1.5万本书籍标题的闭集（对应已知的图书馆库存），另一个是包含230万本书籍标题的开集（用于模拟开放世界场景）。我们在两种设置下报告结果：一方面是纯匹配任务（给定书籍分割和OCR结果，目标是对目标列表执行多对多匹配）；另一方面是结合检测与匹配的任务（书籍必须先被检测和识别，然后才能与目标列表条目匹配）。结果表明，匈牙利匹配法和我们提出的基于BERT的模型均优于模糊字符串匹配基线。我们同时强调了匹配算法固有的局限性：随着目标集规模增大，以及当两个集合（检测到的书籍或目标书籍列表）中任一不完整时，性能会受到影响。数据集和代码可在 https://github.com/llabres/library-dataset 获取。