In this work, we study the alignment (BrainScore) of large language models (LLMs) fine-tuned for moral reasoning on behavioral data and/or brain data of humans performing the same task. We also explore if fine-tuning several LLMs on the fMRI data of humans performing moral reasoning can improve the BrainScore. We fine-tune several LLMs (BERT, RoBERTa, DeBERTa) on moral reasoning behavioral data from the ETHICS benchmark [Hendrycks et al., 2020], on the moral reasoning fMRI data from Koster-Hale et al. [2013], or on both. We study both the accuracy on the ETHICS benchmark and the BrainScores between model activations and fMRI data. While larger models generally performed better on both metrics, BrainScores did not significantly improve after fine-tuning.
翻译:本研究探讨了针对人类执行相同任务的行为数据和/或脑数据微调的大型语言模型(LLMs)在道德推理任务上的对齐程度(BrainScore)。同时,我们探索了基于人类执行道德推理时的功能磁共振成像(fMRI)数据对多个LLMs进行微调是否能够提升其BrainScore。我们对多个LLMs(BERT、RoBERTa、DeBERTa)分别基于ETHICS基准[Hendrycks等人,2020]中的道德推理行为数据、Koster-Hale等人[2013]的道德推理fMRI数据,或两者的组合进行了微调。我们评估了模型在ETHICS基准上的准确率以及模型激活与fMRI数据之间的BrainScore。结果表明,尽管更大规模的模型在两项指标上普遍表现更优,但微调后BrainScore并未出现显著提升。