探究语言模型指令调优的多语言校准效应 (Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning) - 专知论文

会员服务 ·

0

指令调优 · 低资源 · 语言模型 · 学习模型 · 不确定 ·

Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning

翻译：探究语言模型指令调优的多语言校准效应

Jerry Huang,Peng Lu,Qiuhao Zeng,Yusuke Iwasawa,Yutaka Matsuo,Sarath Chandar,Edison Marrese-Taylor,Irene Li

from arxiv, Accepted to The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL)

Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworthiness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or non-existent, resulting in mis-calibration, highlighting a critical shortcoming of standard SFT for multilingual languages. Furthermore, we observe that the use of label smoothing to be a reasonable method alleviate this concern, again without any need for low-resource SFT data, maintaining better calibration across all languages. Overall, this highlights the importance of multilingual considerations for both training and tuning LLMs in order to improve their reliability and fairness in downstream use.

翻译：确保深度学习模型在其预测不确定性方面得到良好校准，对于维持其可信度与可靠性至关重要。然而，尽管基础模型研究不断取得进展，大型语言模型（LLMs）与其校准之间的关系仍是一个开放的研究领域。在本工作中，我们着眼于多语言环境下LLMs校准中的一个关键缺口，试图更好地理解数据稀缺如何可能导致不同的校准效应，以及常用技术在这些环境下的适用性。我们在两个多语言基准（分别涵盖29种和42种语言）上的分析表明，即使在低资源语言中，模型在高资源语言监督微调（SFT）数据集上进行指令调优后，其置信度也会显著提升。然而，准确率的改善却微乎其微甚至不存在，从而导致校准失准，这凸显了标准SFT方法在多语言场景下的一个关键缺陷。此外，我们观察到使用标签平滑是一种合理的方法，可在无需任何低资源SFT数据的情况下缓解这一问题，从而在所有语言中保持更好的校准状态。总体而言，这项工作强调了在LLMs的训练与调优过程中进行多语言考量的重要性，以提升其在下游应用中的可靠性与公平性。

0

相关内容

指令调优

评估大语言模型在科学发现中的作用

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

大型语言模型的规模效应局限

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日

赋能大型语言模型多领域资源挑战

赋能大型语言模型多领域资源挑战

专知会员服务

10+阅读 · 2025年6月10日

大语言模型与小语言模型协同机制综述

大语言模型与小语言模型协同机制综述

专知会员服务

38+阅读 · 2025年5月15日

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

专知会员服务

55+阅读 · 2024年7月24日

多模态大语言模型研究进展！

多模态大语言模型研究进展！

专知会员服务

42+阅读 · 2024年7月15日

数据与多模态大型语言模型的协同作用综述

数据与多模态大型语言模型的协同作用综述

专知会员服务

58+阅读 · 2024年7月13日

【AAAI2024】基于对比上下文学习的自定义语言模型响应

【AAAI2024】基于对比上下文学习的自定义语言模型响应

专知会员服务

26+阅读 · 2024年2月1日

《大型语言模型指令调优》综述

《大型语言模型指令调优》综述

专知会员服务

73+阅读 · 2023年8月27日

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

专知会员服务

26+阅读 · 2022年3月15日

ChatGPT背后大模型如何高效训练？京东探索研究院等最新《大规模深度学习模型高效训练研究》综述，60页pdf详述五大类训练方法

ChatGPT背后大模型如何高效训练？京东探索研究院等最新《大规模深度学习模型高效训练研究》综述，60页pdf详述五大类训练方法

专知

29+阅读 · 2023年4月11日

194篇文献调研ChatGPT最新研究进展！最新《ChatGPT/GPT-4研究综述及对大型语言模型未来的展望》国内外研究者编著

194篇文献调研ChatGPT最新研究进展！最新《ChatGPT/GPT-4研究综述及对大型语言模型未来的展望》国内外研究者编著

专知

25+阅读 · 2023年4月7日

从T5到GPT-4最新最全梳理，人大等《大型语言模型综述》，51页pdf详述大模型进展

从T5到GPT-4最新最全梳理，人大等《大型语言模型综述》，51页pdf详述大模型进展

专知

25+阅读 · 2023年4月4日

多模态视觉语言表征学习研究综述

多模态视觉语言表征学习研究综述

专知

27+阅读 · 2020年12月3日

【Google AI新论文】REALM:检索增强语言模型预训练，QA的SOTA提升4-16%准确性

【Google AI新论文】REALM:检索增强语言模型预训练，QA的SOTA提升4-16%准确性

专知

12+阅读 · 2020年2月12日

最新必读【预训练语言模型(BERT/XLNet等)】论文，Google/微软/华为ICLR2020提交论文

最新必读【预训练语言模型(BERT/XLNet等)】论文，Google/微软/华为ICLR2020提交论文

专知

36+阅读 · 2019年9月29日

一大批中文（BERT等）预训练模型等你认领！

一大批中文（BERT等）预训练模型等你认领！

PaperWeekly

15+阅读 · 2019年6月25日

BAM！利用知识蒸馏和多任务学习构建的通用语言模型

BAM！利用知识蒸馏和多任务学习构建的通用语言模型

机器之心

15+阅读 · 2019年3月18日

自然语言处理中的语言模型预训练方法

自然语言处理中的语言模型预训练方法

PaperWeekly

14+阅读 · 2018年10月21日

自然语言处理中的Attention Model：是什么及为什么

自然语言处理中的Attention Model：是什么及为什么

新智元

11+阅读 · 2017年7月13日

随机振动响应预测中的模型形式不确定性量化方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于反馈型级联连接模型的多模态语义SFM方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

高维回归模型的预测稳定性研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于多模态信息集成的组合预测模型及其应用研究

国家自然科学基金

6+阅读 · 2015年12月31日

基于形态和多词的有限语料蒙汉互译调序优化方法

国家自然科学基金

0+阅读 · 2015年12月31日

考虑不确定性的结构动力学响应模型可信度确认方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于犹豫模糊语言信息的定性决策理论与方法

国家自然科学基金

2+阅读 · 2015年12月31日

强调与对比影响语篇理解的认知过程及其神经机制

国家自然科学基金

4+阅读 · 2015年12月31日

基于中智集的模糊多属性决策理论、方法与应用研究

国家自然科学基金

2+阅读 · 2014年12月31日

多语言大数据环境下的复杂网络行为分析、预测和干预

国家自然科学基金

4+阅读 · 2014年12月31日

Quantifying Risks in Multi-turn Conversation with Large Language Models

Arxiv

0+阅读 · 2月4日

CATTO: Balancing Preferences and Confidence in Language Models

Arxiv

0+阅读 · 1月30日

Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging

Arxiv

0+阅读 · 1月23日

Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging

Arxiv

0+阅读 · 1月22日

The Effect of Scripts and Formats on LLM Numeracy

Arxiv

0+阅读 · 1月21日

Multi-Objective Hierarchical Optimization with Large Language Models

Arxiv

0+阅读 · 1月20日

Balancing Classification and Calibration Performance in Decision-Making LLMs via Calibration Aware Reinforcement Learning

Arxiv

0+阅读 · 1月19日

TransLibEval: Demystify Large Language Models' Capability in Third-party Library-targeted Code Translation

Arxiv

0+阅读 · 1月17日

LLM4Perf: Large Language Models Are Effective Samplers for Multi-Objective Performance Modeling

Arxiv

0+阅读 · 1月8日

AdaptEval: A Benchmark for Evaluating Large Language Models on Code Snippet Adaptation

Arxiv

0+阅读 · 1月8日

VIP会员

文章信息

相关主题

相关VIP内容

评估大语言模型在科学发现中的作用

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

大型语言模型的规模效应局限

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日

赋能大型语言模型多领域资源挑战

赋能大型语言模型多领域资源挑战

专知会员服务

10+阅读 · 2025年6月10日

大语言模型与小语言模型协同机制综述

大语言模型与小语言模型协同机制综述

专知会员服务

38+阅读 · 2025年5月15日

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

专知会员服务

55+阅读 · 2024年7月24日

多模态大语言模型研究进展！

多模态大语言模型研究进展！

专知会员服务

42+阅读 · 2024年7月15日

数据与多模态大型语言模型的协同作用综述

数据与多模态大型语言模型的协同作用综述

专知会员服务

58+阅读 · 2024年7月13日

【AAAI2024】基于对比上下文学习的自定义语言模型响应

【AAAI2024】基于对比上下文学习的自定义语言模型响应

专知会员服务

26+阅读 · 2024年2月1日

《大型语言模型指令调优》综述

《大型语言模型指令调优》综述

专知会员服务

73+阅读 · 2023年8月27日

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

专知会员服务

26+阅读 · 2022年3月15日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】基于自适应表征的高效视觉建模

《多域作战中融合网络、电子战与动能机动》

AI智能体时代大模型安全风险与攻防新挑战

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

相关资讯

ChatGPT背后大模型如何高效训练？京东探索研究院等最新《大规模深度学习模型高效训练研究》综述，60页pdf详述五大类训练方法

ChatGPT背后大模型如何高效训练？京东探索研究院等最新《大规模深度学习模型高效训练研究》综述，60页pdf详述五大类训练方法

专知

29+阅读 · 2023年4月11日

194篇文献调研ChatGPT最新研究进展！最新《ChatGPT/GPT-4研究综述及对大型语言模型未来的展望》国内外研究者编著

194篇文献调研ChatGPT最新研究进展！最新《ChatGPT/GPT-4研究综述及对大型语言模型未来的展望》国内外研究者编著

专知

25+阅读 · 2023年4月7日

从T5到GPT-4最新最全梳理，人大等《大型语言模型综述》，51页pdf详述大模型进展

从T5到GPT-4最新最全梳理，人大等《大型语言模型综述》，51页pdf详述大模型进展

专知

25+阅读 · 2023年4月4日

多模态视觉语言表征学习研究综述

多模态视觉语言表征学习研究综述

专知

27+阅读 · 2020年12月3日

【Google AI新论文】REALM:检索增强语言模型预训练，QA的SOTA提升4-16%准确性

【Google AI新论文】REALM:检索增强语言模型预训练，QA的SOTA提升4-16%准确性

专知

12+阅读 · 2020年2月12日

最新必读【预训练语言模型(BERT/XLNet等)】论文，Google/微软/华为ICLR2020提交论文

最新必读【预训练语言模型(BERT/XLNet等)】论文，Google/微软/华为ICLR2020提交论文

专知

36+阅读 · 2019年9月29日

一大批中文（BERT等）预训练模型等你认领！

一大批中文（BERT等）预训练模型等你认领！

PaperWeekly

15+阅读 · 2019年6月25日

BAM！利用知识蒸馏和多任务学习构建的通用语言模型

BAM！利用知识蒸馏和多任务学习构建的通用语言模型

机器之心

15+阅读 · 2019年3月18日

自然语言处理中的语言模型预训练方法

自然语言处理中的语言模型预训练方法

PaperWeekly

14+阅读 · 2018年10月21日

自然语言处理中的Attention Model：是什么及为什么

自然语言处理中的Attention Model：是什么及为什么

新智元

11+阅读 · 2017年7月13日

相关论文

Quantifying Risks in Multi-turn Conversation with Large Language Models

Arxiv

0+阅读 · 2月4日

CATTO: Balancing Preferences and Confidence in Language Models

Arxiv

0+阅读 · 1月30日

Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging

Arxiv

0+阅读 · 1月23日

Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging

Arxiv

0+阅读 · 1月22日

The Effect of Scripts and Formats on LLM Numeracy

Arxiv

0+阅读 · 1月21日

Multi-Objective Hierarchical Optimization with Large Language Models

Arxiv

0+阅读 · 1月20日

Balancing Classification and Calibration Performance in Decision-Making LLMs via Calibration Aware Reinforcement Learning

Arxiv

0+阅读 · 1月19日

TransLibEval: Demystify Large Language Models' Capability in Third-party Library-targeted Code Translation

Arxiv

0+阅读 · 1月17日

LLM4Perf: Large Language Models Are Effective Samplers for Multi-Objective Performance Modeling

Arxiv

0+阅读 · 1月8日

AdaptEval: A Benchmark for Evaluating Large Language Models on Code Snippet Adaptation

Arxiv

0+阅读 · 1月8日

相关基金

随机振动响应预测中的模型形式不确定性量化方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于反馈型级联连接模型的多模态语义SFM方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

高维回归模型的预测稳定性研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于多模态信息集成的组合预测模型及其应用研究

国家自然科学基金

6+阅读 · 2015年12月31日

基于形态和多词的有限语料蒙汉互译调序优化方法

国家自然科学基金

0+阅读 · 2015年12月31日

考虑不确定性的结构动力学响应模型可信度确认方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于犹豫模糊语言信息的定性决策理论与方法

国家自然科学基金

2+阅读 · 2015年12月31日

强调与对比影响语篇理解的认知过程及其神经机制

国家自然科学基金

4+阅读 · 2015年12月31日

基于中智集的模糊多属性决策理论、方法与应用研究

国家自然科学基金

2+阅读 · 2014年12月31日

多语言大数据环境下的复杂网络行为分析、预测和干预

国家自然科学基金

4+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员