As centralized AI hits compute ceilings and diminishing returns from ever-larger training runs, meeting demand requires an inference layer that scales horizontally in both capacity and capability. We present Fortytwo, a novel protocol that leverages swarm intelligence principles and distributed pairwise ranking consensus to achieve superior performance in AI inference. Our approach reimagines collaboration among AI nodes using swarm inference: a peer-ranked, reputation-weighted consensus across heterogeneous models that surfaces the highest-quality responses. Using pairwise ranking with a custom Bradley-Terry-style aggregation model, we demonstrate that swarm inference substantially outperforms majority voting, achieving 85.90% on GPQA Diamond versus 68.69% for majority voting with the same model set - an improvement of +17.21 percentage points (approximately +25.1% relative). The protocol incorporates on-chain reputation so node influence adapts to demonstrated accuracy over time, yielding a meritocratic consensus that filters low-quality or malicious participants. To resist Sybil attacks, Fortytwo employs proof-of-capability in its consensus: nodes must successfully complete calibration/test requests and stake reputation to enter ranking rounds, making multi-identity attacks economically unattractive while preserving openness. Across six challenging benchmarks, including GPQA Diamond, LiveCodeBench, and AIME, our evaluation indicates higher accuracy and strong resilience to adversarial and noisy free-form prompting (e.g., prompt-injection degradation of only 0.12% versus 6.20% for a monolithic single-model baseline), while retaining practical deployability. Together, these results establish a foundation for decentralized AI systems - democratizing access to high-quality inference through collective intelligence without sacrificing reliability or security.
翻译:随着中心化人工智能面临算力瓶颈以及超大规模训练带来的收益递减,满足需求需要一个在容量和能力上均可水平扩展的推理层。我们提出Fortytwo,一种新颖的协议,它利用群体智能原理和分布式成对排序共识,在AI推理中实现卓越性能。我们的方法通过群体推理重新构想AI节点间的协作:这是一种跨异构模型、基于对等排序和声誉加权的共识机制,旨在筛选出最高质量的响应。通过使用成对排序配合定制的Bradley-Terry风格聚合模型,我们证明群体推理显著优于多数投票法——在GPQA Diamond基准测试中达到85.90%,而相同模型集下的多数投票仅为68.69%,提升了+17.21个百分点(相对提升约+25.1%)。该协议整合了链上声誉机制,使得节点影响力能随时间根据其证实的准确性动态调整,从而形成一种择优共识,有效过滤低质量或恶意参与者。为抵御女巫攻击,Fortytwo在其共识中采用了能力证明机制:节点必须成功完成校准/测试请求并质押声誉才能进入排序轮次,这使得多重身份攻击在经济上缺乏吸引力,同时保持了系统的开放性。在包括GPQA Diamond、LiveCodeBench和AIME在内的六个挑战性基准测试中,我们的评估表明该方法具有更高的准确性,并对对抗性及含噪声的自由形式提示(例如,提示注入导致的性能下降仅为0.12%,而单体单模型基线为6.20%)表现出强大的鲁棒性,同时保持了实际可部署性。综上所述,这些结果为去中心化AI系统奠定了基础——通过集体智能实现高质量推理的民主化访问,且无需牺牲可靠性或安全性。