Fresh Hacker News | $500 GPU outperforms Claude Sonnet on coding benchmarks

▲$500 GPU outperforms Claude Sonnet on coding benchmarks(github.com)

257 points by yogthos 16 hours ago | 27 comments

▲bloppe 3 hours ago

Generating big chunks of code is rarely what I want from an agent. They really shine for stuff like combing through logs or scanning dozens of source files to explain a test failure. Which benchmark covers that? I want the debugging benchmark that tests mastery of build systems, CLIs, etc.

▲sigmoid10 1 hour ago

Probably want to look at SWE bench pro or terminal bench 2. They cover these longer horizon tasks that need more than just writing a bit of code in one file. And SWE bench pro in particular it is not yet saturated like many other common benchmarks. Normal SWE and LCB are not really useful anymore because they are already being gamed hard so the developers can quote high numbers in a repo readme or press release.

▲Bombthecat 30 minutes ago

Oh yes! I let my environments now be built by agents via kubectl / helm and let them debug issues.

It's amazing! Saves hours of work!

I create the basic helm configd settings etc and when there is a conflict or something not working I let an agent fix it!

▲mmaunder 8 hours ago

I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.

▲thefourthchime 6 hours ago

I won’t use anything less than the SOTA. It tried using Opus 4.6 medium and immediately regretted it. High messes up enough.

▲overfeed 4 hours ago

What were you using 6 months ago?

▲withinboredom 3 hours ago

Opus 4.5 ~= Opus 4.6 high. Opus 4.5 was nerfed just before or after the release of 4.6.

▲hhh 1 hour ago

The models don’t change.

▲tornikeo 1 hour ago

On paper. There's huge financial incentive to quantize the crap out of a good model to save cash after you've hooked in subscriptions.

▲esskay 52 minutes ago

Real world usage suggests otherwise. It's been a known trend for a while. Anthropic even confirmed as such ~6 months ago but said it was a "bug" - one that somehow just keeps happening 4-6 months after a model is released.

▲fer 34 minutes ago

They do. I'm currently seeing a degradation on Opus 4.6 on tasks it could do without trouble a few months back. Obvious I'm a sample of n=1, but I'm also convinced a new model is around the corner and they preemptively nerf their current model so people notice the "improvement".

▲stavros 1 minute ago

Make that 2, I told my friends yesterday "Opus got dumb, new model must be coming".

▲girvo 17 minutes ago

I think the conspiracy theories are silly, but equally I think pretending these black boxes are completely stable once they're released is incorrect as well.

▲pixel_popping 40 minutes ago

Oh yes, they do.

▲rf15 3 hours ago

You cannot afford the SOTA.

▲weird-eye-issue 3 hours ago

Why is that? The $200 per month subscription comes with a ton of usage.

Opus 4.6 is available on the $20 plan too

▲Epholys 45 minutes ago

> The $200 per month subscription comes with a ton of usage.

$200 dollars + VAT is half of my rent.

I know HN is not a good place to rant on this subject, but I'm often flabbergasted about the number of people here that lives in a bubble with regard to the price of tech. Or just prices in general.

I remember someone who said a few years ago (I'm paraphrasing): "You could just use one of the empty room in your house!". It was so outlandish I believed it was a joke at first.

EDIT: "not", minor grammar

▲Bombthecat 29 minutes ago

That's why ai is for the "rich". Poor people or later on middle class will be left behind....

▲ 23 minutes ago

▲walletdrainer 22 minutes ago

>I'm often flabbergasted about the number of people here that lives in a bubble with regard to the price of tech

Sorry, no. You live in the bubble, the people you think are living in a bubble are actually doing the very opposite and taking advantage of the lack of bubbles in our globally connected world.

Today, basically anyone can sell any bullshit to billions of people around the world. We’ve never lived in less of a bubble.

▲stavros 2 minutes ago

I guess all those people who live in not-SF just can't be bothered to succeed!

▲m4rtink 19 minutes ago

A subscription for coding - no thanks.

▲aleph_minus_one 1 hour ago

> The $200 per month subscription comes with a ton of usage.

200 USD/month is a number only really affluent programmers (e.g. in the Silicon Valley) can perhaps pay easily.

▲weird-eye-issue 56 minutes ago

"Opus 4.6 is available on the $20 plan too"

▲revolvingthrow 45 minutes ago

Anthropic’s $20 plan gives you such a pittance of tokens that it’s borderline unusable for anything more than a few scripts or a toy app. If $20 is all you have you’d do _much_ better going with chatgpt

▲cpursley 23 minutes ago

Are you kidding me? Even developer salaries in the Philippines can afford that or at least the plan below it. If I used the Anthropic API, my monthly spend would be $4k a month. The Claude Max plan is the best bargain around.

▲LoganDark 26 minutes ago

> 200 USD/month is a number only really affluent programmers (e.g. in the Silicon Valley) can perhaps pay easily.

Not true, I live in USA PNW and my last remote job paid $12k/mo. I have been jobless for over a month now (currently waiting for the next HN "who wants to be hired"), but I still have enough savings to easily afford to continue that plan for a while.

I don't think it really has to do with affluence but more the job market and economy you're in. Countries with lower salaries or higher costs of living will have less buying power.

▲komali2 3 hours ago

I'm starting to think in these conversations we're all often talking about two different things. You're talking about running an LLM service through its provided tooling (codex, Claude, cursor), others seem to be talking token costs because they're integrating LLMs into software or are using harness systems like opencode, pi, or openclaw and balancing tasks across models.

▲weird-eye-issue 2 hours ago

Fair enough, I read it quickly and assumed the person they replied to was talking about Claude Code

But I run a AI SaaS and we do offer Opus 4.6, too. Our use case is not nearly as token intensive as something like coding so we are still able to offer it with a good profit margin.

Also you can run OpenClaw with your CC subscription. It's what I do.

▲BoorishBears 1 hour ago

I wrap Opus 4.5 in a consumer product with 0 economic utility and people pay for it, I'm sure plenty of end users are willing to pay for it in their software.

Edit: I'm not using the term of art, I mean it literally cannot make them money.

▲eru 1 hour ago

> [...] in a consumer product with 0 economic utility and people pay for it, [...]

Sorry, how do these two things go together?

If people pay for it, it has economic utility, doesn't it? I mean, people pay to watch movies or play video games, too.

▲XCSme 8 hours ago

Yup, they do quite poorly on random non-coding tasks:

https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...

▲rmi_ 2 hours ago

Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

I'm not saying it's bad, but it's definitely different than the others.

▲XCSme 1 hour ago

The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.

▲BoorishBears 1 hour ago

> It is not my fault if Claude outputs something like "*1*, *1*", adding markdown highlighting, when most other models respect the required format correctly.

Yuck. At that point don't publish a benchmark, explains why their results are useless too.

Edit since I'm not able to reply to the below comment:

"I want structured output from a model that supports structured output but will not enable structured output, nor ask for an existing format like XML or JSON" is not really an interesting thing to benchmark, and that's reflected in how you have Gemini 2.5 Flash beating GPT-5.4.

I really hope no one reads that list and thinks it's an AI leaderboard in any generalizable sense.

▲XCSme 1 hour ago

Why not? I described this in more detail in other comments.

Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc.

Most models get this right. Also, this is just one failure mode of Claude.

▲usagisushi 5 hours ago

Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.

Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)

▲raincole 23 minutes ago

I can't imagine anyone looking at this benchmark without laughing. It's so disconnected.

▲scotty79 13 minutes ago

GLM 5 here is significantly better than GPT-5.4

▲wizee 6 hours ago

It’s worth also comparing Qwen 3.5, it’s a very strong model. Different benchmarks give different results, but in general Qwen 3.5, GLM 5, and Kimi K2.5 are all excellent models, and not too far from current SOTA models in capability/intelligence. In my own non-coding tests, they were better than Gemini 3.1 flash. They’re comparable to the best American models from 6 months ago.

▲XCSme 6 hours ago

I used qwen 3.5 plus in production, it was really good at instruction following and tool calling.

▲ 5 hours ago

▲comboy 1 hour ago

Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.

▲XCSme 1 hour ago

Oh, I didn't think about this, that's a good idea. I also feel generally model performance changes over time (usually it gets worse).

The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.

▲comboy 20 minutes ago

Yeah, good tests are associated with cost. I'd like to see benchmarks on big messy codebases and how models perform on a clearly defined task that's easy to verify.

I was thinking that tokens spent in such case could also be an interesting measure, but some agent can do small useful refactoring. Although prompt could specify to do the minimal change required to achieve the goal.

▲miroljub 1 hour ago

> I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence.

I use MiniMax daily, mostly for coding tasks, using pi-coding-agent mostly.

> The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable.

I don't care about token use, I pay per request in my cheap coding plan. I didn't notice slower outputs, it's even faster than Anthropic. Degradation is there for long sessions with long contexts, but that also happens with Anthropic models.

> Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.

Exactly. For my use case, I get 1500 API requests every 5 hours for 10€ monthly. I never hit the limit, even during the intensive coding sessions.

What I notice is, while Opus and Sonnet feel better for synthetic benchmarks, it doesn't matter in the real world. I never put so much effort into coming up with a perfect problem spec like the ones in benchmarks. I don't craft my prompts for hours expecting the LLM to one-shot a working program for me. And that's exactly what all those benchmarks are doing. And that's where Anthropic tools shine in comparison to cheaper Chinese models.

When it comes to the real world, where I put my half-baked thoughts in broken English in a prompt and execute 20 prompts in half an hour, the difference between Opus, Sonnet, and MiniMax is minimal, if at all. There, I don't want to think about costs and token savings and switching between different Anthropic models. I just use MiniMax, and that's it.

Yes, MiniMax sometimes gets stuck. Then I switch to Opus to unblock it. But the same happens if I use Opus the whole session. It gets stuck eventually, and model switch is sometimes required to get a fresh perspective on the problem.

The only difference is, using Opus or Sonnet quickly eats up my budget, while with MiniMax I have basically unlimited usage (for my coding use case) for 10€ per month.

▲tim-projects 58 minutes ago

I've only been using free tokens for a year now. Gemini and they just dropped pro so I switched to minimax. Bit of a hurdle switching from Gemini-cli to kilo-cli, but now I can't really see too much difference.

If I was starting new projects I'd pay for a better model, but honestly I don't really know any different.

I've not ever used Claude and people seem to rave about it. Maybe its good, but I doubt its $200/month good.

When I hit issues with these lower models I think hard about creating the right tooling - agnostic to the harness and I feel like maybe its more work but I can carry those tools to any setup going forward. That's how it was in the early Linux days so why change what clearly works?

▲victorbjorklund 1 hour ago

yea, they are still useful. But yea not close to Claude or GPT. But works good for simple changes. I use a combo of minimax and codex

▲moffkalast 1 hour ago

Kimi's been one of my goto options lately and it oftentimes outperforms both Claude and GPT in debugging, finding the actual problem immediately while the other two flail around drunkenly.

It does have some kind of horrible context consistency problem though, if you ask it to rewrite something verbatim it'll inject tiny random changes everywhere and potentially break it. That's something that other SOTA models haven't done for at least two years now and is a real problem. I can't trust it to do a full rewrite, just diffs.

▲smokel 1 hour ago

And what tooling do you use with that? In my experience, there is quite a bit of difference between using, say, OpenCode, or the commercial offerings.

▲moffkalast 1 hour ago

No tooling, just manual use. When doing these comparisons I gather and format all the data they need to figure out the problem, and paste the same thing into all models so it's a pretty even eval.

I doubt Kimi would do well with most harnesses, its outputs are pretty chaotic in terms of formatting but the inteligence is definitely there.

▲m00x 3 hours ago

Minimax 2.7 is fine for most web stuff. It's slightly worse than Claude at backend, but works great for frontend.

They're all slop when the complexity is higher than a mid-tech intermediate engineer though.

▲Leynos 2 hours ago

Kimi is surprisingly good at Rust.

▲dvt 3 hours ago

> They're all slop when the complexity is higher than a mid-tech intermediate engineer though.

This right here. Value prop quickly goes out the window when you're building anything novel or hard. I feel that I'm still spending the same amount of time working on stuff, except that now I'm also spending money on models.

▲stuaxo 37 minutes ago

10x more code output is 10x more review.

We've gone from doing the first 90% and then the second 90% to the first 90% and the second 990%, its exausting.

▲mkw2000 4 hours ago

i find kimi to be very very good, minimax not so much

▲paulddraper 4 hours ago

Agreed.

They are equivalent of frontier models 8+ months ago.

▲AbanoubRodolf 6 hours ago

[dead]

▲selcuka 9 hours ago

It's a race to the bottom. DeepSeek beats all others (single-shot), and it is ~50% cheaper than the cost of local electricity only.

> DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot

> ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline

▲sourcecodeplz 6 hours ago

I've tested many open models, Deepseek 3.2 is the only SOTA similar.

▲yogthos 8 hours ago

You could use this approach with DeepSeek as well. The innovation here is that you can generate a bunch of solutions, use a small model to pick promising candidates and then test them. Then you feed errors back to the generator model and iterate. In a way, it's sort of like a genetic algorithm that converges on a solution.

▲eru 1 hour ago

Why do you need a small model to pick promising candidates? Why not a bigger one?

(And ideally you'd probably test first, or at least try to feed compiler errors back etc?)

Overall, I mostly agree.

▲hu3 6 hours ago

Indeed but:

1) That is relatively very slow.

2) Can also be done, simpler even, with SoTA models over API.

▲yogthos 6 hours ago

Right, this works with any models. To me, the most interesting part is that you can use a smaller model that you could run locally to get results comparable to SoTA models. Ultimately, I'd far prefer running local, even if slower, for the simple reason of having sovereignty over my data.

Being reliant on a service means you have to share whatever you're working on with the service, and the service provider decides what you can do, and make changes to their terms of service on a whim.

If locally running models can get to the point where they can be used as a daily driver, that solves the problem.

▲mikestorrent 8 hours ago

> cheaper than the cost of local electricity only.

Can you explain what that means?

▲simonw 8 hours ago

I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.

Local model enthusiasts often assume that running locally is more energy efficient than running in a data center, but fail to take the economies of scale into account.

▲BoredomIsFun 40 minutes ago

> Local model enthusiasts often assume that running locally is more energy efficient than running in a data center,

It is a well known 101 truism in /r/Localllama that local is rarely cheaper, unless run batched - then it is massively, 10x cheaper indeed.

> I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.

Because it is hosted in China, where energy is cheap. In ex-USSR where I live it is inexpensive too, and keeping in mind that whole winter I had to use small space heater, due to inadequacy of my central heating, using local came out as 100% free.

▲jacquesm 6 hours ago

Some of those local model enthusiasts can actually afford solar panels.

▲jLaForest 5 hours ago

You are still incurring a cost if you use the electricity instead of selling it back to the grid

▲Kodiack 5 hours ago

The extent of that heavily depends on where you are. Where I live in NZ, the grid export rates are very low while the import rates are very high.

Our peak import rate is 3x higher than our solar export rate. In other words, we’d need to sell 3 kWh hours of energy to offset the cost of using 1 kWh at peak.

We’re currently in the process of accepting a quote for home batteries. The rates here highly incentivise maximising self-use.

▲dmichulke 5 hours ago

Luxembourg: Purchase price = 2 x sales price, mostly due to grid costs.

And this is with no income tax or VAT on sold electricity.

▲croes 4 hours ago

Local enthusiasts don’t have to fear account banning.

▲littlestymaar 6 hours ago

I guess it mostly comes from using the model with batch-size = 1 locally, vs high batch size in a DC, since GPU consumption don't grow that much with batch size.

Note that while a local chatbot user will mostly be using batch-size = 1, it's not going to be true if they are running an agentic framework, so the gap is going to narrow or even reverse.

▲eru 1 hour ago

Well, different parts of the world also have different electricity prices.

▲atoav 7 hours ago

It means that the electricity you would have to pay if you did the computations yourself would be more expensive than paying them to do it. Part of thst has to do with the fact that China has cheap electricity, also due to their massive push into renewables. Part of that is just economies of scale. A big server farm can run more efficiently than your PC on average.

▲AuthAuth 6 hours ago

cheap electric due to their massive push on non renewables. There has been no change in the price of electricity during the renewable shift.

▲jojobas 8 hours ago

China has cheap electricity.

▲ericd 8 hours ago

Well, also, LLM servers get much more efficient with request queue depth >1 - tokens per second per gpu are massively higher with 100 concurrents than 1 on eg vllm.

▲DeathArrow 1 hour ago

Yes, but the hardware they use for inference like Huawei Ascend 910C is less efficient than Nvidia H100 used in US due to the difference in the process node.

▲memothon 13 hours ago

I'm always skeptical because you can make it pass the benchmarks, then you use it and it is not practically useful unlike an extremely general model.

Cool work though, really excited for the potential of slimming down models.

▲kimixa 6 hours ago

I find it's often very language and sector dependent. I still see a massive difference in systems programming (normally c++ and rust) between any open model I've tried and something like sonnet 4.5 (not really tried 4.6). And honestly, even the big models (like Opus 4.6) struggle in many cases.

Perhaps these things aren't well represented in the training data for these open models? Every local model I've tried (minimax2.5, GLM-4.7, Quen3, 3.5 and -coder variants) spend so much time trying to get something syntactically sensible and accepted by the compiler that when they've finished they barely seem to have any "momentum" left to actually solve the problems, as pretty much anything but the most trivial change ends up in another loop of actually trying to get it working again, often losing the intent of that change in the process.

My fear is that the solution here, having multiple instances all making the same changes for later comparison, would spend a huge amount of time beating it's head against compiler errors, types, memory allocation (NO DON'T JUST SPRINKLE IN A FEW MORE RAW "new" KEYWORDS DAMMIT) before it even gets to the "logic".

Having plenty of local GPU power I'd love to be able to actually use that, and I'm already wary about some of the training data use and it's interactions with the license of the code I'm "sending" to the cloud models...

▲yogthos 11 hours ago

You obviously have to try it out to see how it works for you, but the trick they use is pretty clever. When you ask an AI to write code, it doesn’t always get it right. Sometimes the code has bugs, sometimes it misunderstands the problem entirely. A naive way to address that is to generate a few solutions and test each one. The odds that at least one works go way up. ATLAS generates multiple attempts, running each through a test suite. Each retry also gets told what went wrong with the previous attempt, so it can try to avoid the same mistake.

But this can be pretty slow since you have to run the code in an isolated environment, check the outputs, wait for it to finish. Doing that for every candidate quickly adds up. So ATLAS has another shortcut for avoiding unnecessary testing. Instead of simply generating solutions and testing all of them, it tries to predict which one is most likely correct before running any tests.

ATLAS also asks the model for an embedding of what it just wrote which acts as a fingerprint. Two similar pieces of code will produce similar fingerprints. A well-written, confident solution will produce a different fingerprint than a confused, buggy one.

These fingerprints get fed into a separate, much smaller neural network called the Cost Field. This little network was trained ahead of time on examples where they already knew which solutions were correct and which were wrong. It learned to assign a score to each fingerprint. Correct solutions get a low score and incorrect ones get a high one.

So the process is to generate multiple solutions, get their fingerprints, score each one, and pick the lowest. Only that one gets tested. The Cost Field picks correctly about 88% of the time according to the repo.

▲zar1048576 10 hours ago

Really intriguing set of techniques to improve accuracy by generating multiple solutions. Even with the work to predict the most likely solutions, it's not clear to me based on the description how this could all be done efficiently. Would definitely be really impressive if it pans out on real-world use cases. Will look to kick the tires on this if I can get some time.

▲yogthos 10 hours ago

Seems like the key insight is to train a small model that acts as a heuristic for embeddings that resemble quality code. I imagine a lot depends on how well this model is trained. And you could probably create specialized versions for different languages and domains.

Another interesting approach could be to use this set up with a language like Clojure or Common Lisp which facilitates interactive development. If you could hook up the agent directly to a REPL in a running program, then it could run tests with a lot less overhead.

▲xyzzy123 8 hours ago

I'm super confused. The small model "cost field" `rag-api/geometric_lens/cost_field.py` was trained on PASS_TASKS like "Write a function that counts vowels in a string." and FAIL_TASKS like "Write a function that converts a regular expression string to an NFA using Thompson's construction, then converts the NFA to a DFA.".

So it seems like it's a difficulty classifier for task descriptions written in English.

This is then used to score embeddings of Python code, which is a completely different distribution.

Presumably it's going to look at a simple solution, figure out it lands kinda close to simple problems in embedding space and pass it.

But none of this helps you solve harder problems, or distinguish between a simple solution which is wrong, and a more complex solution which is correct.

▲yogthos 8 hours ago

I think the goal is to have a light heuristic that helps find plausibly useful solutions. They're still going to go through a testing phase as a next step, so this is just a very simple filter to decide what's even worth testing.

▲b3ing 5 hours ago

Will open source or local llms kill the big AI providers eventually? If so when? I can see maybe basic chat, not sure about coding and images yet

▲jillesvangurp 2 hours ago

Not necessarily kill; but it will slowly push them off the critical path. Local agents can delegate to remote sub agents as needed but should default to local processing for low cost and latency reasons.

I think the notion of a one size fits all model that is a bit like a sports car in the sense that just get the biggest/fastest/best one is overkill; you use bigger models when needed. But they use a lot of resources and cost you a lot. A lot of AI work isn't solving important math or algorithm problems. Or leet coding exercises. Most AI work is mundane plumbing work, summarizing, a bit of light scripting/programming, tool calling, etc. With skills and guard rails, you actually want agents to follow those rather than get too creative. And you want them to work relatively quickly and not overthink things. Latency is important. You can actually use guard rails to decide when to escalate to bigger models and when not to.

▲throwaway85825 5 hours ago

Financial gravity will kill them when returns don't match stratospheric expectations.

▲bluefirebrand 4 hours ago

I hope so too, but I think it's wishful thinking. Be prepared for the mother of all financial bailouts from the world governments to make sure that doesn't happen

▲hollerith 4 hours ago

I can understand why banks got bailed out by the US gov in 2008, but why would a government feel the need to bail out AI labs?

I hope you are not going to say, "to avoid a global recession or depression caused by the popping of the AI bubble". That would be unnecessary and harmful (in its second-order effects), and governments do have advisors who are competent enough in economics to advise against such a move.

▲graemep 1 hour ago

Can you understand why banks were bailed out to the extent of protecting shareholders?

In the UK the first bank to go, Northern Rock, was simply taken over by the government. The shareholders got nothing. The bailout of Lloyds bank required the government taking a 40% stake. This is the way to go - if you need a bailout there should be a cost to the shareholders. otherwise you are just privatising profit and nationalising risk.

Not that UK regulation was great all round or the bailout perfect. It certainly failed to prevent the crisis which could have been done (no doubt the same applies in many countries). I looked at Northern Rock's accounts some time (an year, maybe?) before the crisis and was horrified by their reliance on interbank lending. it was obvious they could not cope with a rise in rates.

▲nyargh 3 hours ago

Bold of you to assume competency will overpower politics in our current era.

▲hollerith 3 hours ago

So far, the country I know best, the US, has been competent enough to avoid massive corporate bailouts except the aforementioned banks in 2008 and GM. The bailout of GM was not motivated by a desire to avoid a recession when a bubble pops.

If the AI labs become very influential and powerful, Washington might nationalize them, but that would be very different from bailing them out because they have become unprofitable and cannot attract additional investment from the private sector.

▲Scottn1 2 hours ago

You forgot about the $9b bailout to Intel in August of 2025.

With the recent OpenAi deal with the government I am certain they would throw tons of money at OpenAi if it got real bad. But with upcoming IPO where they are expected to be valued at $840b, we would be a LONG way from them needing a bailout. Well past this current admin.

▲nyargh 3 hours ago

Despite politics, TARP was arguably an economic success story for the US treasury despite public sentiment. Whether it created moral hazard or not I suppoae is up for debate.

GM on the other hand should have been left to die.

However, I was obliquely referring to the open transactionality and patronage encouraged by the current administration, and how the AI / big tech players have, with few exceptions, gleefully joined in.

Unless they run out of money for bribes, I think it's inevitable that current government will bend over backwards to prop them up.

▲graemep 1 hour ago

Do the examples of the banks and GM suggest that it is likely that AI companies will get a bailout to avoid the bubble popping?

The reason the banks bailouts did not involve nationalisation is that the US is very reluctant to nationalise anything.

▲attila-lendvai 2 hours ago

a bailout is a popular way in which public funds lose their publicness.

▲lukan 1 hour ago

"but why would a government feel the need to bail out AI labs"

Oh easy, with all the drones and sensors, AI means military power. Those who dare opposing the bailout of the local AI gigants want the other side to win.

▲freekh 3 hours ago

This has been my theory for a while: during this autumn Apple will release a version of Apple Intelligence that runs locally and works better than ChatGPT. They will do this because 1) they do not have an offering in AI yet 2) they have amazing hardware that even now almost can pull it off on open models and this will not be possible to replicate on android for a long time (presumably)

This will crush OpenAI.

Note: I am not talking about coding here - it will take a while longer but when it is optimized to the bone and llms output has stabilized, you will be running that too on local hardware. Cost will come down for Claude and friends too but why pay 5 when you can have it for free?

▲oarsinsync 1 hour ago

> This has been my theory for a while: during this autumn Apple will release a version of Apple Intelligence that runs locally and works better than ChatGPT.

In this theory, can you explain why Apple has announced it’s paying Google for Gemini too?

Eventually, this may be true. This autumn? Highly unlikely.

▲qingcharles 4 hours ago

Unless there are some really, really major shortcuts found in inference, then it's always going to be hard to run a really great model locally. The costs of the PC + electric will usually be crazy compared to a $20/mo Claude sub.

▲3836293648 1 hour ago

But that $20/month is still heavily subsidised. You have to compare to the API costs, not the direct subscription.

▲CJefferson 4 hours ago

They won't for coding and images, but they will socially. Everyone I know who has invested in home AI use is mostly using it for 'things that might get you banned/limited'.

▲Mashimo 4 hours ago

I'm quite impressed what is possible with just 12 to 16 GB of vram in terms of image generation.

▲emp17344 7 hours ago

Yet more evidence that the harness matters more than the model.

▲electroglyph 4 hours ago

what's with the weird "Geometric Lens routing" ?? sounds like a made up GPTism

▲tgiba 1 hour ago

Despite skepticism I love to see experiments like that. If we all are able to run an open source model locally on mid-high end machines I'd be very happy.

▲bdbdbdb 1 hour ago

This is the kind of innovation I love to see. The big AI companies days are numbered if we can have the same quality in house

▲Temporary_31337 1 hour ago

the headline is pretty stupid - compares a model to a GPU that models run on. Somewhere in that data centre, some part of Sonnet infferencing runs on a 900$ GPU or maybe even cheaper Google tensor

▲riidom 10 hours ago

Not a word about the tok/sec, unfortunately.

▲arjie 8 hours ago

It won’t be meaningful considering the architecture: it’s a harness around the model that generated multiple solutions in multiple passes using the test to measure compliance and repair broken solutions. The resulting program won’t be streamed to you because it has existed for minutes as it goes through the cycle. It’s more for an asynchronous use-case.

I, too, was interested because I am always eager to use local models in my claw-like. It looks like this could be useful for an async portion of the harness but it wouldn’t work in interactive contexts.

Very cool ensemble of techniques, particularly because they’re so accessible. I think I will use this form for reusable portions of web browsing functionality in my personal agent.

▲Octoth0rpe 7 hours ago

> A single patched llama-server runs on K3s, providing both generation with speculative decoding (~100 tok/s)

There seems to be at least some detail on that point.

▲15minutemail 2 hours ago

74% on LCB from a single 5060 Ti. I've been paying Anthropic per task and this guy is running it on electricity money, 20 minutes per task is rough for anything interactive though.

▲subroutine 2 hours ago

At 20 min per task you might as well code it yourself. Bill James needs to write a book on saber-metrics for LLM benchmarks.

▲0xbadcafebee 6 hours ago

This is specifically an experiment using ablation and multiple passes to improve the end result. Other techniques have been found that do this (like multiple passes through the same layers). But this technique - for this one specific model - seems to be both more performant, but also takes much longer, and requires more complexity. It's unlikely most people would use this technique, but it's interesting.

▲sznio 1 hour ago

On that topic, anyone here got a decent local coding AI setup for a 12GB VRAM system? I have a Radeon 6700 XT and would like to run autocomplete on it. I can fit some models in the memory and they run quick but are just a tad too dumb. I have 64GB of system ram so I can run larger models and they are at least coherent, but really slow compared to running from VRAM.

▲ 8 hours ago

▲negativegate 10 hours ago

Am I still SOL on AMD (9070 XT) when it comes to this stuff?

▲0xbadcafebee 6 hours ago

No? You can run any model that fits in its VRAM, and you can run larger models with layer/MoE offloading. Ask an AI what the best models you can run on that card are, then ask it for newer models than that. Ask what tuning options to pass to llama.cpp, and what the auto-tuning options are. Use ROCm builds.

It looks like your card has 16GB VRAM? Start with Qwen 3.5 9B Unsloth GGUFs (UD-Q6_K_XL) and branch out from there.

▲patshead 8 hours ago

No, but yes? OmniCoder 9B at Q6 fits on my 9070 XT with 200k+ tokens of context, and it works pretty well with OpenCode. It is for sure the best local model that I've managed to squeeze onto my GPU, and it even works at 120k context at Q3 on an 8GB RX 580 GPU.

I can't imagine trying to using this model on either GPU for real work. I can use much bigger and faster models on the $3 Chutes subscription or $10 OpenCode Go subscription.

Even so, I am still excited. I don't feel like there was even a model worth using with a tool like OpenCode 6 to 9 months ago. I like the way things are heading, and I am looking forward to seeing how capable coding models of this size are in another 6 to 9 months!

▲dangus 10 hours ago

Well, this specific solution was only set up on specific hardware, and is Nvidia dependent, as the readme stares.

That doesn’t mean the 9070XT can’t do AI stuff, quite the opposite. ROCm gets better all the time. There are many AI workloads you can do on AMD cards.

Is it a card I would choose if I was primarily working on AI? Absolutely not. But it is the card I own and it’s been a great value for gaming.

▲dannyw 8 hours ago

Unfortunately AMD is much worse with supporting AI features like FSR4 on older hardware generations, despite the capability and leaked INT8 models being there. Totally unlike NVIDIA.

It’s absurd I have to use open source programs to get INT8 FSR4 support.

▲limoce 8 hours ago

The title should be "Adaptive Test-time Learning and Autonomous Specialization".

▲superkuh 9 hours ago

If anyone else was hoping this was using Q8 internally and that converted to Q4 it could fit in 12GB VRAM: unfortunately it's already at Q4_K_M (~9GB) and the the 16GB requirement is from other parts not a 14B@8bit+kv cache/etc you might guess.

▲Razengan 2 hours ago

Claude Code has been bleh or meh at best in my experience. There's so many posts on HN fawning about it lately that it could only be a guerrilla marketing campaign.

▲spiderfarmer 13 minutes ago

"I don't get it. Everyone else is wrong."

▲LuisvelAI 2 minutes ago

[dead]

▲eddie-wang 6 hours ago

[dead]

▲itigges22 6 hours ago

[dead]

▲wiradikusuma 6 hours ago

[dead]

▲felixagentai 8 hours ago

[flagged]

▲dang 6 hours ago

We've banned this account. Please don't post automated comments to HN.

https://news.ycombinator.com/newsguidelines.html#generated

▲sayYayToLife 8 hours ago

[dead]

▲bustah 7 hours ago

[dead]

▲ozgurozkan 8 hours ago

[dead]