Low-resource machine translation (MT) presents a diversity of community needs and application challenges that remain poorly understood. To complement surveys and focus groups, which tend to rely on small samples of respondents, we propose an observational study on actual usage patterns of a specialized MT service for the Tetun language, which is the lingua franca in Timor-Leste. Our analysis of 100,000 translation requests reveals patterns that challenge assumptions based on existing corpora. We find that users, many of them students on mobile devices, typically translate text from a high-resource language into Tetun across diverse domains including science, healthcare, and daily life. This contrasts sharply with available Tetun corpora, which are dominated by news articles covering government and social issues. Our results suggest that MT systems for minority languages like Tetun should prioritize accuracy on domains relevant to educational contexts, in the high-resource to low-resource direction. More broadly, this study demonstrates how observational analysis can inform low-resource language technology development, by grounding research in practical community needs.
翻译:低资源机器翻译(MT)呈现出多样化的社区需求与应用挑战,这些需求与挑战目前仍缺乏深入理解。为补充通常依赖小样本受访者的调查与焦点小组方法,我们针对东帝汶通用语言——德顿语的专用机器翻译服务的实际使用模式,提出了一项观察性研究。通过对十万条翻译请求的分析,我们发现了挑战现有语料库假设的使用模式。研究发现,用户(其中多为使用移动设备的学生)通常将高资源语言文本翻译为德顿语,涉及领域涵盖科学、医疗保健及日常生活。这与现有德顿语料库形成鲜明对比——后者以报道政府与社会议题的新闻文章为主。研究结果表明,针对德顿语等少数语言的机器翻译系统,应优先保证从高资源语言到低资源语言方向、在教育相关领域翻译的准确性。更广泛而言,本研究通过将研究扎根于实际社区需求,展示了观察性分析如何为低资源语言技术发展提供实证依据。