Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.
翻译:尽管开源大语言模型(LLM)发展迅速,欧洲葡萄牙语(pt-PT)在训练数据和原生评估中仍显著不足,基于机器翻译的基准测试很可能缺失该变体的语言及文化细微差异。我们提出AMALIA——一个完全开源的LLM,通过在中训练和后训练阶段使用更多高质量的pt-PT数据,优先服务于欧洲葡萄牙语。为更忠实地评估pt-PT,我们发布了一套pt-PT基准测试集,包含经过翻译的标准任务和四个针对pt-PT生成、语言能力及pt-PT/pt-BR偏差的新数据集。实验表明,AMALIA在翻译基准测试中与强基线模型表现相当,同时在pt-PT专项评估中性能显著提升,支持了针对欧洲葡萄牙语进行定向训练和原生基准测试的必要性。