Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen's Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.
翻译:在财报电话会议中检测规避性回答对于金融透明度至关重要,但大规模基准数据的缺乏阻碍了研究进展。本文提出EvasionBench,包含30,000个训练样本和1,000个人工标注测试样本(Cohen's Kappa 0.835),涵盖三个规避等级。我们的核心贡献在于提出一种多模型标注框架,其关键洞见是:前沿大语言模型之间的分歧信号标志着对训练最具价值的困难样本。我们通过挖掘两个强标注模型产生冲突的边界案例,并引入裁判模型进行标签裁决。该方法相比单模型蒸馏性能提升2.4%,且裁判裁决样本在训练损失更高的情况下(0.421对比0.393)仍能提升泛化能力——这证明分歧挖掘起到了隐式正则化的作用。我们训练的Eva-4B模型(40亿参数)达到81.3%的准确率,较其基础模型提升25个百分点,以极低的推理成本逼近前沿大语言模型的性能水平。