LLM Benchmarks: Leaderboard

https://chat.lmsys.org/?leaderboard

This leaderboard is based on the following three benchmarks.

  • Chatbot Arena - a crowdsourced, randomized battle platform. We use 40K+ user votes to compute Elo ratings.
  • MT-Bench - a set of challenging multi-turn questions. We use GPT-4 to grade the model responses.
  • MMLU (5-shot) - a test to measure a model’s multitask accuracy on 57 tasks.

:computer: We use fastchat.llm_judge to compute MT-bench scores (single-answer grading on a scale of 10). The Arena Elo ratings are computed by this notebook. The MMLU scores are computed by InstructEval and Chain-of-Thought Hub. Higher values are better for all benchmarks. Empty cells mean not available.