Linking Representations with Multimodal Contrastive Learning

Many applications require grouping instances contained in diverse document datasets into classes. Most widely used methods do not employ deep learning and do not exploit the inherently multimodal nature of documents. Notably, record linkage is typically conceptualized as a string-matching problem. This study develops CLIPPINGS, (Contrastively Linking Pooled Pre-trained Embeddings), a multimodal framework for record linkage. CLIPPINGS employs end-to-end training of symmetric vision and language bi-encoders, aligned through contrastive language-image pre-training, to learn a metric space where the pooled image-text representation for a given instance is close to representations in the same class and distant from representations in different classes. At inference time, instances can be linked by retrieving their nearest neighbor from an offline exemplar embedding index or by clustering their representations. The study examines two challenging applications: constructing comprehensive supply chains for mid-20th century Japan through linking firm level financial records - with each firm name represented by its crop in the document image and the corresponding OCR - and detecting which image-caption pairs in a massive corpus of historical U.S. newspapers came from the same underlying photo wire source. CLIPPINGS outperforms widely used string matching methods by a wide margin and also outperforms unimodal methods. Moreover, a purely self-supervised model trained on only image-OCR pairs also outperforms popular string-matching methods without requiring any labels.

翻译：《基于多模态对比学习的表征链接》摘要：许多应用需要将分散在多源文档数据集中的实例进行类别分组。当前最广泛使用的方法既未采用深度学习技术，也未利用文档固有的多模态特性。值得注意的是，记录链接通常被概念化为字符串匹配问题。本研究开发了CLIPPINGS（对比链接池化预训练嵌入）——一种用于记录链接的多模态框架。CLIPPINGS通过对齐通过对比语言-图像预训练的对称视觉与语言双编码器进行端到端训练，以学习一个度量空间，在该空间中，同一类别实例的池化图像-文本表征相互接近，而不同类别实例的表征则相互远离。在推理时，可通过从离线示例嵌入索引中检索最近邻或对表征进行聚类来实现实例链接。本研究考察了两个具有挑战性的应用场景：通过链接企业级财务记录构建20世纪中期日本的综合供应链（每个企业名称由其文档图像中的截取区域及对应OCR表示），以及检测海量历史美国报纸语料库中哪些图像-标题对源自同一摄影通讯社源。实验表明，CLIPPINGS不仅在性能上大幅超越广泛使用的字符串匹配方法，还优于单模态方法。此外，仅使用图像-OCR对训练的纯自监督模型，在无需任何标签的情况下，同样优于流行的字符串匹配方法。