Fine-tuning large language models (LLMs) with domain-specific instructions has emerged as an effective method to enhance their domain-specific understanding. Yet, there is limited work that examines the core characteristics acquired during this process. In this study, we benchmark the fundamental characteristics learned by contact-center (CC) specific instruction fine-tuned LLMs with out-of-the-box (OOB) LLMs via probing tasks encompassing conversational, channel, and automatic speech recognition (ASR) properties. We explore different LLM architectures (Flan-T5 and Llama), sizes (3B, 7B, 11B, 13B), and fine-tuning paradigms (full fine-tuning vs PEFT). Our findings reveal remarkable effectiveness of CC-LLMs on the in-domain downstream tasks, with improvement in response acceptability by over 48% compared to OOB-LLMs. Additionally, we compare the performance of OOB-LLMs and CC-LLMs on the widely used SentEval dataset, and assess their capabilities in terms of surface, syntactic, and semantic information through probing tasks. Intriguingly, we note a relatively consistent performance of probing classifiers on the set of probing tasks. Our observations indicate that CC-LLMs, while outperforming their out-of-the-box counterparts, exhibit a tendency to rely less on encoding surface, syntactic, and semantic properties, highlighting the intricate interplay between domain-specific adaptation and probing task performance opening up opportunities to explore behavior of fine-tuned language models in specialized contexts.
翻译:通过领域特定指令微调大语言模型(LLMs)已被证明是增强其领域理解能力的有效方法。然而,目前鲜有研究深入探究此过程中习得的核心特征。本研究通过涵盖对话、信道和自动语音识别(ASR)特性的探针任务,对经联络中心(CC)特定指令微调的大语言模型与未微调(OOB)大语言模型的基础特征进行基准测试。我们探索了不同LLM架构(Flan-T5和Llama)、参数量级(3B、7B、11B、13B)以及微调范式(全量微调对比PEFT)。研究发现,CC-LLMs在领域内下游任务上表现显著提升,响应可接受度相较于OOB-LLMs提升超过48%。此外,我们通过探针任务比较了OOB-LLMs与CC-LLMs在广泛使用的SentEval数据集上的性能,并评估了其在表层、句法和语义信息方面的能力。有趣的是,我们观察到探针分类器在各类探针任务上的表现相对一致。结果表明,CC-LLMs虽在性能上超越未微调模型,但表现出对编码表层、句法和语义属性的依赖性降低,这揭示了领域适配与探针任务性能之间的复杂交互关系,为探索特定场景下微调语言模型的行为开辟了新方向。