Public repositories host millions of fine-tuned models, yet community usage remains disproportionately concentrated on a small number of foundation checkpoints. We investigate whether this concentration reflects efficient market selection or if superior models are systematically overlooked. Through an extensive evaluation of over 2,000 models, we show the prevalence of "hidden gems", unpopular fine-tunes that significantly outperform their popular counterparts. Notably, within the Llama-3.1-8B family, we find rarely downloaded checkpoints that improve math performance from 83.2% to 96.0% without increasing inference costs. However, discovering these models through exhaustive evaluation of every uploaded model is computationally infeasible. We therefore formulate model discovery as a Multi-Armed Bandit problem and accelerate the Sequential Halving search algorithm by using shared query sets and aggressive elimination schedules. Our method retrieves top models with as few as 50 queries per candidate, accelerating discovery by over 50x.
翻译:公共模型仓库托管了数百万个经过微调的模型,然而社区使用仍然不成比例地集中在少数基础检查点上。我们研究这种集中现象究竟是反映了有效的市场选择,还是存在系统性忽视更优模型的情况。通过对超过2000个模型进行广泛评估,我们揭示了"隐藏瑰宝"的普遍存在性——这些不受欢迎的微调模型显著优于其流行的对应版本。值得注意的是,在Llama-3.1-8B模型家族中,我们发现鲜有下载的检查点能够将数学推理性能从83.2%提升至96.0%,且不增加推理成本。然而,通过对每个上传模型进行穷举评估来发现这些模型在计算上是不可行的。为此,我们将模型发现问题建模为多臂老虎机问题,通过采用共享查询集和激进淘汰策略,对序列减半搜索算法进行了加速。我们的方法仅需每个候选模型约50次查询即可检索出最优模型,将发现速度提升了50倍以上。