Large Language Model (LLM)-based agents have significantly impacted Task-Oriented Dialog Systems (TODS) but continue to face notable performance challenges, especially in zero-shot scenarios. While prior work has noted this performance gap, the behavioral factors driving the performance gap remain under-explored. This study proposes a comprehensive evaluation framework to quantify the behavior gap between AI agents and human experts, focusing on discrepancies in dialog acts, tool usage, and knowledge utilization. Our findings reveal that this behavior gap is a critical factor negatively impacting the performance of LLM agents. Notably, as task complexity increases, the behavior gap widens (correlation: 0.963), leading to a degradation of agent performance on complex task-oriented dialogs. For the most complex task in our study, even the GPT-4o-based agent exhibits low alignment with human behavior, with low F1 scores for dialog acts (0.464), excessive and often misaligned tool usage with a F1 score of 0.139, and ineffective usage of external knowledge. Reducing such behavior gaps leads to significant performance improvement (24.3% on average). This study highlights the importance of comprehensive behavioral evaluations and improved alignment strategies to enhance the effectiveness of LLM-based TODS in handling complex tasks.
翻译:基于大语言模型(LLM)的智能体已对任务导向对话系统(TODS)产生显著影响,但在零样本场景下仍面临明显的性能挑战。尽管先前研究已注意到这一性能差距,但驱动该差距的行为因素仍未得到充分探索。本研究提出了一个综合评估框架,旨在量化AI智能体与人类专家之间的行为鸿沟,重点关注对话行为、工具使用和知识利用方面的差异。我们的研究结果表明,这一行为鸿沟是影响LLM智能体性能的关键负面因素。值得注意的是,随着任务复杂度的增加,行为鸿沟会扩大(相关性:0.963),导致智能体在复杂任务导向对话中的性能下降。在我们研究中最复杂的任务中,即使是基于GPT-4o的智能体也表现出与人类行为的低一致性:对话行为的F1分数较低(0.464),工具使用过度且常常失准(F1分数为0.139),并且对外部知识的利用效率低下。减少此类行为鸿沟可带来显著的性能提升(平均提升24.3%)。本研究强调了综合行为评估和改进对齐策略对于提升基于LLM的任务导向对话系统处理复杂任务效能的重要性。