The dominance of proprietary LLMs has led to restricted access and raised information privacy concerns. High-performing open-source alternatives are crucial for information-sensitive and high-volume applications but often lag behind in performance. To address this gap, we propose (1) A untargeted variant of iterative self-critique and self-refinement devoid of external influence. (2) A novel ranking metric - Performance, Refinement, and Inference Cost Score (PeRFICS) - to find the optimal model for a given task considering refined performance and cost. Our experiments show that SoTA open source models of varying sizes from 7B - 65B, on average, improve 8.2% from their baseline performance. Strikingly, even models with extremely small memory footprints, such as Vicuna-7B, show a 11.74% improvement overall and up to a 25.39% improvement in high-creativity, open ended tasks on the Vicuna benchmark. Vicuna-13B takes it a step further and outperforms ChatGPT post-refinement. This work has profound implications for resource-constrained and information-sensitive environments seeking to leverage LLMs without incurring prohibitive costs, compromising on performance and privacy. The domain-agnostic self-refinement process coupled with our novel ranking metric facilitates informed decision-making in model selection, thereby reducing costs and democratizing access to high-performing language models, as evidenced by case studies.
翻译:专有LLMs的主导地位导致访问受限并引发信息隐私担忧。高性能的开源替代方案对于信息敏感和高流量应用至关重要,但往往在性能上落后。为弥合这一差距,我们提出:(1)一种无外部影响的非定向迭代自我批评与自我优化变体;(2)一种新型排名指标——性能、优化与推理成本评分(PeRFICS),用于在考虑优化性能与成本的情况下,为给定任务找到最优模型。实验表明,参数量从7B到65B不等的最先进开源模型,平均性能较基准提升8.2%。值得注意的是,即使是内存占用极小的模型(如Vicuna-7B),在Vicuna基准测试中整体性能提升11.74%,在高创造性开放式任务中性能提升高达25.39%。Vicuna-13B更进一步,在优化后甚至超越ChatGPT。这项工作对希望在不承担高昂成本、不牺牲性能与隐私的前提下利用LLMs的资源受限与信息敏感环境具有深远意义。领域无关的自我优化过程结合新型排名指标,有助于在模型选择中做出明智决策,从而降低成本并推动高性能语言模型的民主化——案例研究结果也验证了这一点。