Blind and low-vision (BLV) audiences remain underserved by visual art descriptions, particularly across languages and in museum settings where privacy and intellectual-property constraints may favour small on-premise vision-language models (VLMs). This pilot study investigates curator-guided multilingual art description with Qwen2.5-VL-3B-Instruct for German, Romanian, and Serbian. We construct a parallel BLV-oriented caption corpus from artwork images and metadata, and compare language-specific LoRA adapters with a single multilingual adapter under a fixed backbone and training budget. Evaluation combines automatic lexical and embedding-based metrics with an LLM-as-Judge protocol calibrated against a small Romanian BLV pilot study. Under our pilot setup, language-specific adapters show more stable controllability and visually grounded description quality for Romanian and Serbian, while multilingual adaptation remains competitive in German. We frame these findings as deployment-oriented evidence for small on-premise VLMs, and highlight the need for larger BLV user studies and broader language coverage before drawing general conclusions about multilingual accessibility.
翻译:盲人和低视力(BLV)观众在视觉艺术描述方面仍未得到充分服务,尤其在博物馆场景中,隐私与知识产权限制往往倾向于采用小型本地部署的视觉语言模型(VLM)。本初步研究以Qwen2.5-VL-3B-Instruct为模型,针对德语、罗马尼亚语和塞尔维亚语,探索了策展人引导下的多语言艺术描述方法。我们利用艺术作品图像及其元数据构建了面向BLV的平行字幕语料库,并在固定基座模型与训练预算条件下,比较了单语言LoRA适配器与单一多语言适配器的性能。评估结合了自动词汇与嵌入指标,以及经过小型罗马尼亚语BLV用户研究校准的大语言模型作为评审(LLM-as-Judge)协议。在本实验设定下,针对罗马尼亚语和塞尔维亚语,单语言适配器展现出更稳定的可控性与更高的视觉描述质量,而多语言适配器在德语中仍具竞争力。我们将这些发现视为小型本地部署VLM的部署导向证据,并强调在得出关于多语言无障碍性的普适结论之前,仍需开展更大规模的BLV用户研究并覆盖更广泛的语言种类。