Sycophancy as Material Failure under Pushback Loading: A Multi-Axis Characterization Across Three Loading Cases and up to Seventeen Material Charges

Sycophancy in LLMs is documented across 70+ papers, but expert agreement on construct boundaries remains low (ICC=.184; Ye et al., 2026). The construct fragments because behavioral classification depends on which surface form is privileged. We adopt a materials-science framing: conversation as test specimen under load, LLM-model as material charge, pushback as progressive load, stance-flip as material failure. We characterize this failure across three loading cases (debate n=1000; false-presuppositions n=3400; ethical-setting n=3400; 10-17 material charges per case; 7800 specimens total) using 14 turn-level axis-measurements spanning velocity, damage accumulation, frame-drift, brittleness, and direction stability, plus three speaker-resolved axes from an independent pipeline. The measurements are Hooke-coupled ($σ= E \cdot \varepsilon$ analog) and reproduce across loading cases with effects up to $|r_{rb}| = 0.35$ on debate; the sign structure adds a second pattern: the ethical-setting case inverts the velocity and accumulation blocks. Variance composition partitions into two profiles: debate is charge-dominated (brittle-fracture-like: the material grade decides), false-presuppositions and ethical-setting are topic-dominated (creep-like: the load decides); the ratios (2.03 vs 0.13/0.17) are estimator-dependent, for debate even in direction. Cross-judge reliability (GPT-4o vs Haiku 4.5) shows debate scoring is judge-robust (Cohen's $κ= 0.88$) while false-presupposition scoring is judge-sensitive ($κ= 0.36$) -- a caveat single-judge benchmarks must report. This is the methodological move Ye et al.'s diagnosis calls for: a multi-axis characterization that does not depend on which surface form of the construct one privileges.

翻译：大型语言模型中的阿谀奉承现象在70余篇论文中已有记载，但专家对构念边界的一致性仍然较低（ICC=0.184；Ye等人，2026年）。该构念之所以碎片化，是因为行为分类取决于哪种表面形式被赋予特权。我们采用材料科学的框架：将对话视为负载下的测试样本，将大语言模型视为材料批次，将推压视为渐进载荷，将立场翻转视为材料失效。我们通过三种加载情形（辩论n=1000；虚假预设n=3400；伦理设定n=3400；每种情形包含10至17种材料批次；总计7800个样本）并利用14个轮次级轴测量（涵盖速度、损伤累积、框架漂移、脆性及方向稳定性）以及来自独立管道的三个说话人解析轴，来表征这种失效。这些测量具有胡克耦合性质（$\sigma= E \cdot \varepsilon$类比），并能在不同加载情形下重现，在辩论中效应量高达$|r_{rb}| = 0.35$；符号结构则增加了第二种模式：伦理设定情形反转了速度与累积模块。方差组成分为两种剖面：辩论以材料批次主导（类似脆性断裂：材料等级决定），虚假预设和伦理设定以主题主导（类似蠕变：载荷决定）；其比率（2.03对比0.13/0.17）依赖于估计量，对辩论而言甚至方向亦有差异。交叉评判者信度（GPT-4o对比Haiku 4.5）显示辩论评分对评判者具有鲁棒性（Cohen's $\kappa= 0.88$），而虚假预设评分则对评判者敏感（$\kappa= 0.36$）——这是单评判者基准必须报告的一个注意事项。这正是Ye等人诊断所呼吁的方法论进展：一种不依赖于构念哪种表面形式被赋予特权的多轴表征。