We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.
翻译:本文介绍了一个针对皮埃蒙特语(意大利西北部濒危罗曼语)的众包数据集。该数据集包含145对从Flores+衍生的意大利语-皮埃蒙特语平行句对,其翻译由母语者按其自然拼写习惯(而非遵循标准化规范)完成,并辅以人工词汇对齐标注。我们利用该资源对多种大语言模型在分词一致性、主题分类和机器翻译任务上进行基准测试。分析表明:相较于高资源罗曼语,皮埃蒙特语存在分词惩罚现象,但大语言模型在其主题分类任务上的表现已接近意大利语、法语和英语的水平。机器翻译结果呈现不对称性:模型能够将皮埃蒙特语充分翻译为高资源语言,但生成皮埃蒙特语译文仍具挑战。本数据集与代码均已公开发布。