Understanding the quality of a performance evaluation metric is crucial for ensuring that model outputs align with human preferences. However, it remains unclear how well each metric captures the diverse aspects of these preferences, as metrics often excel in one particular area but not across all dimensions. To address this, it is essential to systematically calibrate metrics to specific aspects of human preference, catering to the unique characteristics of each aspect. We introduce MetaMetrics, a calibrated meta-metric designed to evaluate generation tasks across different modalities in a supervised manner. MetaMetrics optimizes the combination of existing metrics to enhance their alignment with human preferences. Our metric demonstrates flexibility and effectiveness in both language and vision downstream tasks, showing significant benefits across various multilingual and multi-domain scenarios. MetaMetrics aligns closely with human preferences and is highly extendable and easily integrable into any application. This makes MetaMetrics a powerful tool for improving the evaluation of generation tasks, ensuring that metrics are more representative of human judgment across diverse contexts.
翻译:理解性能评估指标的质量对于确保模型输出与人类偏好保持一致至关重要。然而,目前尚不清楚每个指标在多大程度上捕捉了这些偏好的不同方面,因为指标往往在某一特定领域表现出色,但并非在所有维度上都如此。为解决这一问题,必须根据人类偏好的特定方面系统性地校准指标,以适应每个方面的独特特征。我们提出MetaMetrics,一种经过校准的元指标,旨在以监督方式评估不同模态的生成任务。MetaMetrics通过优化现有指标的组合,增强其与人类偏好的一致性。我们的指标在语言和视觉下游任务中均展现出灵活性和有效性,并在多种多语言和多领域场景中显示出显著优势。MetaMetrics与人类偏好高度契合,具备极强的可扩展性,并能轻松集成到任何应用中。这使得MetaMetrics成为改进生成任务评估的强大工具,确保指标能更全面地代表不同情境下的人类判断。