Information extraction, e.g., attribute value extraction, has been extensively studied and formulated based only on text. However, many attributes can benefit from image-based extraction, like color, shape, pattern, among others. The visual modality has long been underutilized, mainly due to multimodal annotation difficulty. In this paper, we aim to patch the visual modality to the textual-established attribute information extractor. The cross-modality integration faces several unique challenges: (C1) images and textual descriptions are loosely paired intra-sample and inter-samples; (C2) images usually contain rich backgrounds that can mislead the prediction; (C3) weakly supervised labels from textual-established extractors are biased for multimodal training. We present PV2TEA, an encoder-decoder architecture equipped with three bias reduction schemes: (S1) Augmented label-smoothed contrast to improve the cross-modality alignment for loosely-paired image and text; (S2) Attention-pruning that adaptively distinguishes the visual foreground; (S3) Two-level neighborhood regularization that mitigates the label textual bias via reliability estimation. Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
翻译:信息抽取(例如属性值抽取)长期仅基于文本进行研究和构建。然而,许多属性(如颜色、形状、图案等)可以从基于图像的抽取中受益。视觉模态长期未被充分利用,主要原因是多模态标注的难度。在本文中,我们旨在将视觉模态修补到基于文本的属性信息抽取器中。跨模态融合面临若干独特挑战:(C1)图像和文本描述在样本内和样本间存在松散配对;(C2)图像通常包含丰富的背景信息,可能误导预测;(C3)来自基于文本抽取器的弱监督标签对多模态训练存在偏差。我们提出PV2TEA,一个编码器-解码器架构,配备三种偏差缩减方案:(S1)增强型标签平滑对比,以改善松散配对的图像和文本的跨模态对齐;(S2)注意力剪枝,自适应区分视觉前景;(S3)两级邻域正则化,通过可靠性估计缓解标签文本偏差。在真实电子商务数据集上的实证结果表明,与单模态基线相比,F1值最多提升11.74%绝对值(20.97%相对值)。