Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language's high degree of ambiguity when unvocalized, recent machine learning approaches have significantly advanced performance on this task. In this work, we present DiVRit, a novel system for Hebrew diacritization that frames the task as a zero-shot classification problem. Our approach operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on the surrounding textual context. A key innovation of DiVRit is its use of a Hebrew Visual Language Model to process diacritized candidates as images, allowing diacritic information to be embedded directly within their vector representations while the surrounding context remains tokenization-based. Through a comprehensive evaluation across various configurations, we demonstrate that the system effectively performs diacritization without relying on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting where the correct diacritized form is guaranteed to be among the provided candidates, DiVRit achieves a high level of accuracy. Furthermore, strategic architectural enhancements and optimized training methodologies yield significant improvements in the system's overall generalization capabilities. These findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization.
翻译:希伯来语变音符恢复是确保单词发音准确性和消除文本歧义的基础任务。尽管该语言在非元音化状态下存在高度歧义性,但近期的机器学习方法已显著提升了此任务的性能表现。本文提出DiVRit,一种新颖的希伯来语变音符标注系统,将任务构建为零样本分类问题。该方法在词级别进行操作,根据周围文本语境从动态生成的候选集中为每个未标注变音符的单词选择最合适的变音符模式。DiVRit的核心创新在于使用希伯来语视觉语言模型将带变音符的候选词作为图像处理,使得变音符信息能直接嵌入其向量表示中,而周围语境仍保持基于分词的处理方式。通过对多种配置的综合评估,我们证明该系统能够在不依赖复杂显式语言分析的情况下有效执行变音符标注。值得注意的是,在“预言机”设置(确保正确变音符形式存在于给定候选集中)下,DiVRit实现了高精度水平。此外,通过策略性架构增强和优化的训练方法,系统的整体泛化能力得到显著提升。这些发现凸显了视觉表示在实现准确自动化希伯来语变音符标注方面的巨大潜力。