Understanding the quality of a performance evaluation metric is crucial for ensuring that model outputs align with human preferences. However, it remains unclear how well each metric captures the diverse aspects of these preferences, as metrics often excel in one particular area but not across all dimensions. To address this, it is essential to systematically calibrate metrics to specific aspects of human preference, catering to the unique characteristics of each aspect. We introduce MetaMetrics, a calibrated meta-metric designed to evaluate generation tasks across different modalities in a supervised manner. MetaMetrics optimizes the combination of existing metrics to enhance their alignment with human preferences. Our metric demonstrates flexibility and effectiveness in both language and vision downstream tasks, showing significant benefits across various multilingual and multi-domain scenarios. MetaMetrics aligns closely with human preferences and is highly extendable and easily integrable into any application. This makes MetaMetrics a powerful tool for improving the evaluation of generation tasks, ensuring that metrics are more representative of human judgment across diverse contexts.
翻译:理解性能评估度量的质量对于确保模型输出与人类偏好保持一致至关重要。然而,目前尚不清楚每种度量在多大程度上捕捉了这些偏好的不同方面,因为度量通常在某一方面表现出色,但并非在所有维度上都如此。为解决这一问题,必须根据人类偏好的特定方面系统性地校准度量,以适应每个方面的独特特征。我们提出了MetaMetrics,一种经过校准的元度量,旨在以监督方式评估不同模态的生成任务。MetaMetrics通过优化现有度量的组合来增强其与人类偏好的一致性。我们的度量在语言和视觉下游任务中均表现出灵活性和有效性,在各种多语言和多领域场景中显示出显著优势。MetaMetrics与人类偏好高度一致,具有极强的可扩展性,并能轻松集成到任何应用中。这使得MetaMetrics成为改进生成任务评估的强大工具,确保度量在不同情境下更能代表人类判断。