the decays are just more capable other models entering the population, making all prior models lose more frequently
If a more capable show up and starts
beating all the other models
There is an instance of this in the chart. In 2025-06-24 when Gemini-2.5-pro shows up. As you can see, the ELO of the others do not drop.That sounded like a press bulletin, so just to let you clarify yourself: Does that mean you may switch to lightly quantized models?
I am willing to bet large amounts of money that OpenAI would never release a model served as fully BF16 in the year of our lord 2026. That would be insane operationally. They're almost certainly doing QAT to FP4 for FFN, and a similar or slightly larger quant for attention tensors.
I know this might be a question that's impossible for you to answer, but if you can shed any light to this matter, I'd be grateful as I am doing an analysis over what AI solutions that can be suitable for my organisation.
APIs have much smaller ones
Novita's has occassional problem counting white space. DeepSeek hosted does not.
No idea why.
You can't use Elo scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests.
Relative ranking systems extract more information per tournament. You will get something approximating the actual latent skill level with enough of them.
New models are on average better than older models, the average skill of the population of models increases over time and so you are mathematically guaranteed that any existing model will over time degrade in Elo score even though it didn't change itself in any way.
It's like benchmarking a model against a list of challenges that over time are made more and more difficult and then claiming the model got nerfed because its score declined.
Elo is good at establishing an overall ranking order across models but that's not what this is about.
To detect nerfing of a model, projects like https://marginlab.ai/trackers/claude-code/ are much much better (I'm not affiliated in any way).
I hope to see the other labs can bring back competition soon!
Thank you, I just looked at the chart and said to myself: ELO? YOLO!
That Elo ranking is also called chess ranking
Honestly, in my opinion, GPT-5.5 Codex doesn't just crush Claude Code 4.7 opus —it's writing code at a level so advanced that I sometimes struggle to even fully comprehend it. Even when navigating fairly massive codebases spanning four different languages and regions (US, China, Korea, and Japan), Codex's performance is simply overwhelming.
How would we even go about properly measuring and benchmarking the Elo for autonomous agents like this?
Most of the code I've worked with comes from Korean and Chinese startups, industrial contractors, or older corporate research-lab environments. So I know my frame of reference is limited.
When I write code, I usually rely on fairly conservative patterns: Result-style error handling instead of throwing exceptions through business logic, aggressive use of guard clauses, small policy/strategy objects, and adapters at I/O boundaries. I also prefer placing a normalization layer before analysis and building pure transformation pipelines wherever possible.
So when Codex produced a design that decoupled the messy input adapter from the stable normalized data, and then separated that from the analyzer, it wasn't just 'fancier code.' It aligned perfectly with the architectural direction I already value, but it pushed the boundaries of that design further than I would have initially done myself.
This is exactly why I hesitate to dismiss code as 'bad' just because I don't immediately understand it. Sometimes, it really is just bad code. But sometimes, the abstraction is simply a bit ahead of my current local mental model, and I only grasp its true value after a second or third requirement is introduced.
To be completely honest, using AI has caused a significant drop in my programming confidence. Since AI is ultimately trained on codebases written by top-tier programmers, its output essentially represents the average of those top developers—or perhaps slightly below their absolute peak.
I often find myself realizing that the code I write by hand simply cannot beat it