Braille plays a vital role in education and information accessibility for visually impaired individuals. However, Braille information processing faces challenges such as data scarcity and ambiguities in mixed-text contexts. We construct English and Chinese Braille Mixed Datasets (EBMD/CBMD) with mathematical formulas to support diverse Braille domain research, and propose a syntax tree-based augmentation method tailored for Braille data. To address the underperformance of traditional fine-tuning methods in Braille-related tasks, we investigate Braille Knowledge-Based Fine-Tuning (BKFT), which reduces the learning difficulty of Braille contextual features. BrailleLLM employs BKFT via instruction tuning to achieve unified Braille translation, formula-to-Braille conversion, and mixed-text translation. Experiments demonstrate that BKFT achieves significant performance improvements over conventional fine-tuning in Braille translation scenarios. Our open-sourced datasets and methodologies establish a foundation for low-resource multilingual Braille research.
翻译:盲文在视障人士的教育和信息获取中发挥着至关重要的作用。然而,盲文信息处理面临着数据稀缺以及混合文本语境下存在歧义等挑战。我们构建了包含数学公式的英文和中文盲文混合数据集(EBMD/CBMD),以支持多样化的盲文领域研究,并提出了一种专为盲文数据设计的基于语法树的增强方法。针对传统微调方法在盲文相关任务中表现不佳的问题,我们研究了基于盲文知识的微调方法(BKFT),该方法降低了盲文上下文特征的学习难度。BrailleLLM 通过指令微调应用 BKFT,实现了统一的盲文翻译、公式到盲文的转换以及混合文本翻译。实验表明,在盲文翻译场景中,BKFT 相较于传统微调方法取得了显著的性能提升。我们开源的数据集与方法为低资源多语言盲文研究奠定了基础。