能力评估论文 - 专知

会员服务 ·

能力评估

SciText2Eq: Assessing LLMs for Explainable Equation Generation for Scientific Creativity

Arxiv

0+阅读 · 6月14日

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering

Arxiv

0+阅读 · 6月5日

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation

Arxiv

0+阅读 · 6月3日

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations

Arxiv

0+阅读 · 5月30日

Student Competency Assessment and Presentation Methods Based on Algorithm Courses

Arxiv

0+阅读 · 5月29日

Beyond Model Size: Probing the Gaps in Visual in-Context Learning by Training a Tiny Model

Arxiv

0+阅读 · 6月9日

CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments

Arxiv

0+阅读 · 4月7日

Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Arxiv

0+阅读 · 4月8日

Survey of Computerized Adaptive Testing: A Machine Learning Perspective

Arxiv

0+阅读 · 3月15日

Real-time Win Probability and Latent Player Ability via STATS X in Team Sports

Arxiv

0+阅读 · 2月23日

Evaluating and Improving Automated Repository-Level Rust Issue Resolution with LLM-based Agents

Arxiv

0+阅读 · 2月26日

Survey of Computerized Adaptive Testing: A Machine Learning Perspective

Arxiv

0+阅读 · 3月9日

Evaluating LLM-Based Grant Proposal Review via Structured Perturbations

Arxiv

0+阅读 · 3月9日

Evaluating LLM-Based Grant Proposal Review via Structured Perturbations

Arxiv

0+阅读 · 3月11日

Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts

Arxiv

0+阅读 · 2月22日

参考链接

微信扫码咨询专知VIP会员