The impact of machine translation (MT) on low-resource languages remains poorly understood. In particular, observational studies of actual usage patterns are scarce. Such studies could provide valuable insights into user needs and behaviours, complementing survey-based methods. Here we present an observational analysis of real-world MT usage for Tetun, the lingua franca of Timor-Leste, using server logs from a widely-used MT service with over $70,000$ monthly active users. Our analysis of $100,000$ translation requests reveals patterns that challenge assumptions based on existing corpora. We find that users, many of them students on mobile devices, typically translate short texts into Tetun across diverse domains including science, healthcare, and daily life. This contrasts sharply with available Tetun corpora, which are dominated by news articles covering government and social issues. Our results suggest that MT systems for languages like Tetun should prioritise translating into the low-resource language, handling brief inputs effectively, and covering a wide range of domains relevant to educational contexts. More broadly, this study demonstrates how observational analysis can inform low-resource language technology development, by grounding research in practical community needs.
翻译:机器翻译(MT)对低资源语言的影响仍鲜为人知。特别是针对实际使用模式的观察性研究十分匮乏。此类研究能够为基于调查的方法提供补充,从而深入揭示用户需求与行为模式。本文通过对东帝汶通用语言——德顿语的实际机器翻译使用情况进行观察性分析,数据源自一项月活跃用户超过 $70,000$ 的广泛使用的机器翻译服务的服务器日志。我们对 $100,000$ 条翻译请求的分析揭示了与基于现有语料库的假设相悖的模式。研究发现,用户(其中许多是使用移动设备的学生)通常将短文本翻译成德顿语,涉及领域广泛,涵盖科学、医疗保健及日常生活。这与现有德顿语料库形成鲜明对比,后者主要由涉及政府和社会议题的新闻报道构成。我们的结果表明,针对德顿语等语言的机器翻译系统应优先考虑向低资源语言的翻译方向,有效处理简短输入,并覆盖与教育情境相关的广泛领域。更广泛而言,本研究通过将研究立足于实际社区需求,展示了观察性分析如何能为低资源语言技术发展提供指导。