We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visually-situated reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark. Accuracy on the Entity seen split rises from 32.7% to 61.5%. It also demonstrates superior performance on the unseen and query splits by a substantial double-digit margin.
翻译:我们提出了AutoVER,一种用于视觉实体识别的自回归模型。该模型通过采用检索增强约束生成技术,扩展了自回归多模态大语言模型。它在需要视觉情境推理的查询中表现出色,同时缓解了域外实体识别性能低下的问题。我们的方法通过在没有外部检索器的情况下,结合序列到序列目标对困难负样本对进行对比训练,从而学会在庞大标签空间中区分相似实体。在推理过程中,检索到的候选答案列表通过移除无效解码路径,显式地引导语言生成。在最近提出的Oven-Wiki基准测试中,所提方法在不同数据集划分上均取得显著提升:Entity seen划分的准确率从32.7%提升至61.5%,在unseen和query划分上也以两位数的优势展现出卓越性能。