Sycophancy in LLMs is documented across 70+ papers, but expert agreement on construct boundaries remains low (ICC=.184; Ye et al., 2026). The construct fragments because behavioral classification depends on which surface form is privileged. We adopt a materials-science framing: conversation as test specimen under load, LLM-model as material charge, pushback as progressive load, stance-flip as material failure. We characterize this failure across three loading cases (debate n=1000; false-presuppositions n=3400; ethical-setting n=3400; 10-17 material charges per case; 7800 specimens total) using 14 turn-level axis-measurements spanning velocity, damage accumulation, frame-drift, brittleness, and direction stability, plus three speaker-resolved axes from an independent pipeline. The measurements are Hooke-coupled ($σ= E \cdot \varepsilon$ analog) and reproduce across loading cases with effects up to $|r_{rb}| = 0.35$ on debate; the sign structure adds a second pattern: the ethical-setting case inverts the velocity and accumulation blocks. Variance composition partitions into two profiles: debate is charge-dominated (brittle-fracture-like: the material grade decides), false-presuppositions and ethical-setting are topic-dominated (creep-like: the load decides); the ratios (2.03 vs 0.13/0.17) are estimator-dependent, for debate even in direction. Cross-judge reliability (GPT-4o vs Haiku 4.5) shows debate scoring is judge-robust (Cohen's $κ= 0.88$) while false-presupposition scoring is judge-sensitive ($κ= 0.36$) -- a caveat single-judge benchmarks must report. This is the methodological move Ye et al.'s diagnosis calls for: a multi-axis characterization that does not depend on which surface form of the construct one privileges.
翻译:大型语言模型中的阿谀奉承现象在70余篇论文中已有记载,但专家对构念边界的一致性仍然较低(ICC=0.184;Ye等人,2026年)。该构念之所以碎片化,是因为行为分类取决于哪种表面形式被赋予特权。我们采用材料科学的框架:将对话视为负载下的测试样本,将大语言模型视为材料批次,将推压视为渐进载荷,将立场翻转视为材料失效。我们通过三种加载情形(辩论n=1000;虚假预设n=3400;伦理设定n=3400;每种情形包含10至17种材料批次;总计7800个样本)并利用14个轮次级轴测量(涵盖速度、损伤累积、框架漂移、脆性及方向稳定性)以及来自独立管道的三个说话人解析轴,来表征这种失效。这些测量具有胡克耦合性质($\sigma= E \cdot \varepsilon$类比),并能在不同加载情形下重现,在辩论中效应量高达$|r_{rb}| = 0.35$;符号结构则增加了第二种模式:伦理设定情形反转了速度与累积模块。方差组成分为两种剖面:辩论以材料批次主导(类似脆性断裂:材料等级决定),虚假预设和伦理设定以主题主导(类似蠕变:载荷决定);其比率(2.03对比0.13/0.17)依赖于估计量,对辩论而言甚至方向亦有差异。交叉评判者信度(GPT-4o对比Haiku 4.5)显示辩论评分对评判者具有鲁棒性(Cohen's $\kappa= 0.88$),而虚假预设评分则对评判者敏感($\kappa= 0.36$)——这是单评判者基准必须报告的一个注意事项。这正是Ye等人诊断所呼吁的方法论进展:一种不依赖于构念哪种表面形式被赋予特权的多轴表征。