What if you give opus the same harness? Do people even care about meaningful comparisons any more or is it all just “numbers go up”
This is the state of "AI" these days I guess...
[1] https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
According to the authors the harness isn't ARC-AGI specific though https://x.com/agenticasdk/status/2037335806264971461
The hard part of these tests isn't purely reasoning ability ffs.
This seems like an arbitrary restriction. Tool-use requires a harness, and their whitepaper never defines exactly what counts as valid.
That said their harness isn't generic. It includes a ridiculously detailed prompt for how to play this specific game. Forbidding tool use is arbitrary and above all pointless hoop jumping but that doesn't make the linked "achievement" any less fraudulent.
EDIT from https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf:
> We seek to fight two forms of overfitting that would muddy public sensefinding:
> Task-specific overfitting. This includes any agent that is created with knowledge of public ARC-AGI-3 environments, subsequently being evaluated on the same environments. It could be either directly trained on these environments, or using a harness that is handcrafted or specifically configured by someone with knowledge of the public environments.
Not sure if the specific rules of this prize allow that, but I would accept that
Anyway, searching both in ARC-AGI's paper and website and directly on kaggle, I failed to find a with-harness leaderboard; can you please give the link?
> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
I wish we'd move past public test sets for LLM benchmarks: publish a plain english explanation of the tasks, allow questions and clarifications, and but never release a single question from the test set verbatim.
It made sense back when models needed to be finetuned on the task to even reliably answer. If we're saying this is the path to AGI we should be able to rely on the generalization of the model to get it right.
If a model was trained on <|begin_text|> <|end_text|> and you change the tokens passed to <|start_text|> <|end_text|>, it loses several 'IQ points' if it can even answer back at all anymore.
Synthetic data is fine. Synthetic data on very similar questions generated based on the description is typically fine. But once the shape of what you're training on gets too close to the actual holdout questions, you're getting an uplift that's not realistic for unseen tasks.
1. https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
Here is the ARC-AGI-3 specific harness by the way - lots of challenge information encoded inside: https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...