https://chat.lmsys.org/?leaderboard
This leaderboard is based on the following three benchmarks.
- Chatbot Arena - a crowdsourced, randomized battle platform. We use 40K+ user votes to compute Elo ratings.
- MT-Bench - a set of challenging multi-turn questions. We use GPT-4 to grade the model responses.
- MMLU (5-shot) - a test to measure a model’s multitask accuracy on 57 tasks.
We use fastchat.llm_judge to compute MT-bench scores (single-answer grading on a scale of 10). The Arena Elo ratings are computed by this notebook. The MMLU scores are computed by InstructEval and Chain-of-Thought Hub. Higher values are better for all benchmarks. Empty cells mean not available.