Reasoning language models (RLMs) and the intermediate chains of thought they emit play an increasingly central role in multi-agent setups such as inter-model monitoring or distillation into smaller models. When agents at different capability tiers must cooperate, strong models need to produce traces digestible by weaker ones. We refer to this goal as "weak-to-strong legibility". Trustworthiness of large models depends in part on this legibility property. For safety oversight in particular, adoption of weak monitors may become a standard for reliability scaffolds on a healthy budget. Legibility requires that the shape of these decision-making traces takes some form accessible to weaker monitors. Existing efficiency-based metrics for legibility fail to capture "thoroughness", instead focusing on conciseness.
翻译:推理语言模型及其产生的中间思维链在多智能体设置(如模型间监控或向小模型蒸馏)中发挥着日益核心的作用。当不同能力层级的智能体需要协作时,强模型必须生成弱模型能够理解的决策轨迹。我们将此目标称为"弱到强可读性"。大模型的可信性部分依赖于这种可读性属性。特别是在安全监督方面,采用弱监督者可能成为健康预算下可靠性框架的标准范式。可读性要求这些决策轨迹的形态需采取弱监控者可访问的形式。现有基于效率的可读性指标未能捕捉"彻底性",而仅关注简洁性。