LLM Benchmarks: Leaderboard

blaise · July 9, 2023, 2:09am

https://chat.lmsys.org/?leaderboard

This leaderboard is based on the following three benchmarks.

Chatbot Arena - a crowdsourced, randomized battle platform. We use 40K+ user votes to compute Elo ratings.
MT-Bench - a set of challenging multi-turn questions. We use GPT-4 to grade the model responses.
MMLU (5-shot) - a test to measure a model’s multitask accuracy on 57 tasks.

We use fastchat.llm_judge to compute MT-bench scores (single-answer grading on a scale of 10). The Arena Elo ratings are computed by this notebook. The MMLU scores are computed by InstructEval and Chain-of-Thought Hub. Higher values are better for all benchmarks. Empty cells mean not available.