Untranslatability, cases where meaning cannot be directly preserved across languages, is well-studied in linguistics but underexplored in NLP. As machine translation (MT) systems improve on standard benchmarks, their limitations increasingly concentrate in such cases, where translation cannot be reduced to one-to-one equivalence. We introduce a structured ontology of untranslatability along with a taxonomy of compensation strategies, which are specific techniques to convey meaning under these untranslatable circumstances. We operationalize this framework into a multilingual dataset of untranslatable sentences paired with strategy-based translations, enabling controlled analysis of translation behavior. Initial human preference studies suggest that translation quality depends on the strategy used, with consistent preferences for outputs that include explanatory context, known as the Annotation compensation strategy. Our framework and dataset provide a foundation for studying and modeling strategy-informed machine translation.
翻译:不可译性,即意义无法在不同语言间直接保留的情况,在语言学中已有深入研究,但在自然语言处理(NLP)领域却鲜有探索。随着机器翻译(MT)系统在标准基准测试中性能提升,其局限性日益集中于这类无法简化为一一对应关系的翻译案例。我们提出了一种结构化的不可译性本体论,并配套定义了补偿策略的分类体系——这些策略是在不可译情境下传达特定意义的具体技术手段。我们将该框架转化为一个多语言不可译句子数据集,每句均配有基于策略的翻译,从而实现对翻译行为的受控分析。初步的人工偏好研究表明,翻译质量取决于所用策略,且用户一致偏好包含解释性上下文的输出,即所谓的注释补偿策略。我们的框架与数据集为研究基于策略的机器翻译及构建相应模型奠定了基础。