Consider a marketplace of AI tools, each with slightly different strengths and weaknesses. By picking the right model for the task at hand, a user can do better than simply using the same model for everything. Routers operate under a similar principle, where sophisticated model selection can increase overall performance. However, aggregation is often noisy, reflecting imperfect user choices or routing decisions. This leads to two main questions: first, what does a "healthy marketplace" of models look like for maximizing consumer utility? Secondly, how can we incentivize producers to create such models? We show that winrate, a standard benchmark in LLM evaluation, can incentivize model creators to homogenize for both types of model changes, reducing consumer welfare. We propose a new mechanism, weighted winrate, which rewards models for answers that are higher quality, and show that it provably improves incentives for producers to specialize and increases consumer welfare. We conclude by exploring the impact of our theoretical results in empirical benchmark datasets and discussing implications for benchmark design.
翻译:暂无翻译