Handwritten word retrieval is vital for digital archives but remains challenging due to large handwriting variability and cross-lingual semantic gaps. While large vision-language models offer potential solutions, their prohibitive computational costs hinder practical edge deployment. To address this, we propose a lightweight asymmetric dual-encoder framework that learns unified, style-invariant visual embeddings. By jointly optimizing instance-level alignment and class-level semantic consistency, our approach anchors visual embeddings to language-agnostic semantic prototypes, enforcing invariance across scripts and writing styles. Experiments show that our method outperforms 28 baselines and achieves state-of-the-art accuracy on within-language retrieval benchmarks. We further conduct explicit cross-lingual retrieval, where the query language differs from the target language, to validate the effectiveness of the learned cross-lingual representations. Achieving strong performance with only a fraction of the parameters required by existing models, our framework enables accurate and resource-efficient cross-script handwriting retrieval.
翻译:手写词检索对于数字档案至关重要,但由于手写风格差异巨大以及跨语言语义鸿沟的存在,该任务仍具挑战性。尽管大型视觉-语言模型提供了潜在的解决方案,但其高昂的计算成本阻碍了在实际边缘设备上的部署。为解决此问题,我们提出了一种轻量级非对称双编码器框架,用于学习统一的、风格不变的视觉嵌入。通过联合优化实例级对齐与类级语义一致性,我们的方法将视觉嵌入锚定到与语言无关的语义原型上,从而强制模型在不同文字和书写风格间保持表示不变性。实验表明,我们的方法在28个基线模型上表现优异,并在同语言检索基准上达到了最先进的准确率。我们进一步进行了显式的跨语言检索实验(查询语言与目标语言不同),以验证所学跨语言表征的有效性。仅需现有模型参数量的极小部分即可实现强劲性能,本框架能够实现准确且资源高效的跨文字手写检索。