The prevalence of rapidly evolving slang, neologisms, and highly stylized expressions in informal user-generated text, particularly on Chinese social media, poses significant challenges for Machine Translation (MT) benchmarking. Specifically, we identify two primary obstacles: (1) data scarcity, as high-quality parallel data requires bilingual annotators familiar with platform-specific slang, and stylistic cues in both languages; and (2) metric limitations, where traditional evaluators like COMET often fail to capture stylistic fidelity and nonstandard expressions. To bridge these gaps, we introduce CSM-MTBench, a benchmark covering five Chinese-foreign language directions and consisting of two expert-curated subsets: Fun Posts, featuring context-rich, slang- and neologism-heavy content, and Social Snippets, emphasizing concise, emotion- and style- driven expressions. Furthermore, we propose tailored evaluation approaches for each subset: measuring the translation success rate of slang and neologisms in Fun Posts, while assessing tone and style preservation in Social Snippets via a hybrid of embedding-based metrics and LLM-as-a-judge. Experiments on over 20 models reveal substantial variation in how current MT systems handle semantic fidelity and informal, social-media-specific stylistic cues. CSM-MTBench thus serves as a rigorous testbed for advancing MT systems capable of mastering real-world Chinese social media texts.
翻译:非正式用户生成文本中快速演变的俚语、新词和高度风格化表达的盛行,特别是中文社交媒体上的此类文本,对机器翻译基准测试构成了重大挑战。具体而言,我们识别出两个主要障碍:(1) 数据稀缺,因为高质量的平行数据需要熟悉平台特定俚语及双语风格线索的双语标注者;(2) 评估指标局限,传统评估器如COMET通常无法捕捉风格保真度和非标准表达。为弥补这些差距,我们引入了CSM-MTBench基准,涵盖五个中文-外语方向,并由两个专家策划的子集构成:趣味帖子,包含语境丰富、俚语和新词密集的内容;社交片段,强调简洁、情感和风格驱动的表达。此外,我们为每个子集提出了定制化的评估方法:衡量趣味帖子中俚语和新词的翻译成功率,同时通过基于嵌入的指标与LLM-as-a-judge的混合方法,评估社交片段中语气和风格的保留程度。在超过20个模型上的实验表明,当前MT系统在处理语义保真度和非正式、社交媒体特定风格线索方面存在显著差异。因此,CSM-MTBench可作为推进能够掌握真实世界中文社交媒体文本的MT系统的严格测试平台。