Althea: Human-AI Collaboration for Fact-Checking and Critical Reasoning

The web's information ecosystem demands fact-checking systems that are both scalable and epistemically trustworthy. Automated approaches offer efficiency but often lack transparency, while human verification remains slow and inconsistent. We introduce Althea, a retrieval-augmented system that integrates question generation, evidence retrieval, and structured reasoning to support user-driven evaluation of online claims. On the AVeriTeC benchmark, Althea achieves a Macro-F1 of 0.44, outperforming standard verification pipelines and improving discrimination between supported and refuted claims. We further evaluate Althea through a controlled user study and a longitudinal survey experiment (N=963), comparing three interaction modes that vary in the degree of scaffolding: an Exploratory mode with guided reasoning, a Summary mode providing synthesized verdicts, and a Self-search mode that offers procedural guidance without algorithmic intervention. Results show that guided interaction produces the strongest immediate gains in accuracy and confidence, while self-directed search yields the most persistent improvements over time. This pattern suggests that performance gains are not driven solely by effort or exposure, but by how cognitive work is structured and internalized. Participants consistently described Althea as transparent and supportive of reflective reasoning, emphasizing its ability to organize evidence and clarify competing claims. By integrating retrieval, interaction, and pedagogical scaffolding, Althea demonstrates how human--AI interaction can move beyond automated verdicts toward durable improvements in reasoning. These findings advance the design of trustworthy, human-centered fact-checking systems that balance guidance with epistemic autonomy.

翻译：网络信息生态系统需要既具备可扩展性又保持认知可信度的事实核查系统。自动化方法虽能提供效率，但往往缺乏透明度；人工验证则仍显缓慢且不一致。我们提出阿尔泰亚（Althea），一种检索增强型系统，通过整合问题生成、证据检索与结构化推理，支持用户自主评估网络言论。在AVeriTeC基准测试中，阿尔泰亚取得0.44的宏平均F1值，超越标准验证流程，并提升对支持性言论与反驳性言论的区分能力。我们进一步通过受控用户研究与纵向调查实验（N=963）评估阿尔泰亚，比较三种不同支架程度的交互模式：提供引导性推理的探索模式、提供综合判断的摘要模式，以及仅提供程序性指导而无需算法干预的自搜索模式。结果表明，引导式交互能带来即时准确率与置信度的最大提升，而自主搜索模式则产生最持久的随时间改进效果。这一模式表明，性能提升并非仅由努力程度或信息暴露驱动，而是取决于认知工作如何被结构化与内化。参与者一致认为阿尔泰亚透明且支持反思性推理，特别强调其在组织证据与澄清矛盾观点方面的能力。通过整合检索、交互与教学支架，阿尔泰亚展示了人机交互如何超越自动化判断，迈向推理能力的持久提升。这些发现推动了可信赖、以人为中心的事实核查系统设计，在引导与认知自主性之间实现平衡。