Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.
翻译:基于视觉语言模型(VLM)的检索器已将视觉文档检索(VDR)的质量提升至令人瞩目的水平。此类方法在文档索引和查询编码时均需使用相同的数十亿参数编码器,导致即使处理纯文本查询也存在高延迟和强GPU依赖问题。我们发现这一设计存在不必要的对称性:文档具有视觉复杂性,需要强大的视觉理解能力,而查询仅为简短文本字符串。NanoVDR通过解耦两条编码路径来利用这种查询-文档不对称性:冻结的20亿参数VLM教师模型离线处理文档索引,而经蒸馏的纯文本学生模型(小至6900万参数)在推理时编码查询。核心设计在于蒸馏目标的选择。通过对三种骨干网络在22个ViDoRe基准数据集上系统比较六种目标函数,我们发现基于查询文本的点态余弦对齐方法持续优于基于排序和对比学习的替代方案,且训练时仅需预缓存的教师查询嵌入而无需文档处理。进一步地,我们识别出跨语言迁移是主要性能瓶颈,并通过使用机器翻译查询增强训练数据以低成本解决该问题。最终得到的NanoVDR-S-Multi(DistilBERT,6900万参数)保留了教师模型95.1%的性能,在v2和v3版本上以32倍更少的参数和50倍更低的CPU查询延迟超越DSE-Qwen2(20亿参数),且总训练成本低于13 GPU小时。