Skip to content

Instantly share code, notes, and snippets.

@endolith
Last active September 20, 2024 00:21
Show Gist options
  • Save endolith/e001d8b7811699cf9be822a774e7cb67 to your computer and use it in GitHub Desktop.
Save endolith/e001d8b7811699cf9be822a774e7cb67 to your computer and use it in GitHub Desktop.
I tried to plot AGI on the Chatbot Arena Elo scale by comparing to "both bad" and "tie" votes
*.json
*checkpoint.ipynb

I tried to plot AGI on the same Elo scale by comparing to "both bad" and "tie" votes

(Or, rather, I had an LLM write it for me. (But another LLM checked it and said it was correct, so...))

When a battle is voted as a tie, the "ideal model" is also considered to have tied with both. When a battle is voted as "both bad", then the ideal model is considered to have beaten both. So it acts as an upper bound for Elo scores, and since the judgments are from humans, a model that scores that well all the time would be human-equivalent?

https://gist.github.com/endolith/e001d8b7811699cf9be822a774e7cb67

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/discussions/67

A scatter plot showing Elo rating estimates for various large language models from the LMSYS Chatbot Arena. The x-axis lists the models, such as GPT-4, LLaMA, and others, with their names displayed vertically. The y-axis represents the Elo rating. Each point represents a model's median Elo rating with error bars indicating the 95% confidence intervals. The highest-rated models are on the left, and the ratings decrease as you move to the right.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@endolith
Copy link
Author

endolith commented Sep 20, 2024

A scatter plot showing Elo rating estimates for various large language models from the LMSYS Chatbot Arena. The x-axis lists the models, such as GPT-4, LLaMA, and others, with their names displayed vertically. The y-axis represents the Elo rating. Each point represents a model's median Elo rating with error bars indicating the 95% confidence intervals. The highest-rated models are on the left, and the ratings decrease as you move to the right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment