Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.
For me, Opus 4.6 isn't working quite right currently, and I often use GLM 5.1 instead. I'd prefer to use peak Opus over GLM 5.1, but GLM 5.1 is an adequate fallback. It's incredible how good open-weight models have gotten.
i have a feeling its nearing opus 4.5 level if they could fix it getting crazy after like 100k tokens.
Mid-sized models like gpt-oss minimax and qwen3.5 122b are around 6%, and gemma4 31b around 7% (but much slower).
I haven’t tried Opus or ChatGPT due to high costs on openrouter for this application.
My use cases are not code editing or authoring related, but when it comes to understanding a codebase and it's docs to help stakeholders write tasks or understand systems it has always outperformed american models at roughly half the price.
It's a fun way to quantify the real-world performance between models that's more practical and actionable.
Overeager, but I was really really impressed.
So, it has been convenient to not have hard stops / allow for extra but I still try to /clear at an actual 25% of the 1M anyhow.
This is in contrast to my use of the 1M opus model this past fall over the API, which seemed to perform more steadily.
Claude Opus at 150K context starts getting dumber and dumber.
Claude Opus at 200K+ is mentally retarded. Abandon hope and start wrapping up the session.
If you want quality you still have to compact or start new contextes often.
For around a month the limit seemed to be a little over 60k! I was despondent!!
What's worse is that when it launched it was stable across the context window. My (wild) guess is that the model is stable but z.ai is doing something wonky with infrastructure, that they are trying to move from one context window to another or have some kv cache issues or some such, and it doesn't really work. If you fork or cancel in OpenCode there's a chance you see the issue much earlier, which feels like some other kind of hint about kv caching, maybe it not porting well between different shaped systems.
More maliciously minded, this artificial limit also gives them an artificial way to dial in system load. Just not delivering the context window the model has reduces the work of what they have to host?
But to the question: yes compaction is absolutely required. The ai can't even speak it's just a jumbled stream of words and punctuation once this hits. Is manual compaction required? One could find a way to build this into the harness, so no, it's a limitation of our tooling that our tooling doesn't work around the stated context window being (effectively) a lie.
I'd really like to see this improved! At least it's not 60-65k anymore; those were soul crushing weeks, where I felt like my treasured celebrated joyful z.ai plan was now near worthless.
There's a thread https://news.ycombinator.com/item?id=47678279 , and I have more extensive history / comments on what I've seen there.
The question is: will this reproduce on other hosts, now that glm-5.1 is released? I expect the issue is going to be z.ai specific, given what I've seen (200k works -> 60k -> 100k context windows working on glm-5.1).
During off peak hour a simple 3 line CSS change took over 50 minutes and it routinely times out mid-tool and leaves dangling XML and tool uses everywhere, overwriting files badly or patching duplicate lines into files
But it's all casual side projects.
Edit: I often to /compact at around 100 000 token or switch to a new session. Maybe that is why.
For the price this is a pretty damn impressive model.
Providers like DeepInfra are already giving access to 5.1 https://deepinfra.com/zai-org/GLM-5.1
$1.40 in $4.40 out $0.26 cached
/ 1M tokens
That's more expensive than other models, but not terrible, and will go down over time, and is far far cheaper than Opus or Sonnet or GPT.
I haven't had any bad luck with DeepInfra in particular with quantization or rate limiting. But I've only heard bad things about people who used z.ai directly.
Devil's advocate: why shouldn't they do it if OpenAI, Anthropic and Google get away with playing this game?
> "build a Linux-style desktop environment as a web application"
They claim "50 applications from scratch", but "Browser" and a bunch of the other apps are likely all <iframe> elements.We all know that building a spec-compliant browser alone is a herculean task.
Would it succeed? Probably not, but it would be way more interesting, even if it didn't work.
I find things like Claude's C compiler way more interesting where, even though CCC is objectively bad (code is messy, generates very bad unoptimized code, etc) it at least is something cool and shows that with some human guideance it could generate something even better.
Excited to test this.
For short-term bugfixing and tweaks though, it does about what I'd expect from Sonnet for a pretty low price.
https://github.com/Opencode-DCP/opencode-dynamic-context-pru...
Since the entire purpose, focus and motivation of this model seems to have been "coherency over longer contexts", doesn't that issue makes it not an OK model? It's bad at the thing it's supposed to be good at, no?
It does devolve into gibberish at long context (~120k+ tokens by my estimation but I haven't properly measured), but this is still by far the best bang-for-buck value model I have used for coding.
It's a fine model
as kimi did a huge amount of claude distilation it seems to be somewhat based in data
https://www.anthropic.com/news/detecting-and-preventing-dist...
I'm curious how the bang for buck ratio works in comparison. My initial tests for coding tasks have been positive and I can run it at home. Bigger models I assume are still better on harder tasks.
So I need them to not only not devolve into gibberish, but remain smart enough to be useful at contexts several times longer than that.
I suspect that this isn't the model, but something that z.ai is doing with hosting it. At launch I was related to find glm-5.1 was stable even as the context window filled all the way up (~200k). Where-as glm-5, while it could still talk and think, but had forgotten the finer points of tool use to the point where it was making grevious errors as it went (burning gobs of tokens to fix duplicate code problems).
However, real brutal changes happened sometimes in the last two or three months: the parent problem emerged and emerged hard, out of nowhere. Worse, for me, it seemed to be around 60k context windows, which was heinous: I was honestly a bit despondent that my z.ai subscription had become so effectively useless. That I could only work on small problems.
Thankfully the coherency barrier raised signficiantly around three weeks go. It now seems to lose its mind and emits chaotic non-sentance gibberish around 100k for me. GLM-5 was already getting pretty shaky at this point, so I feel like I at least have some kind of parity. But at least glm-5 was speaking & thinking with real sentances, I could keep conversing with it somewhat, where-as glm-5.1 seems to go from perfectly level headed working fine to all of a sudden just total breakdown, hard switch, at such a predictable context window size.
It seems so so probable to me that this isn't the model that's making this happen: it's the hosting. There's some KV cache issue, or they are trying to expand the context window in some way, or to switch from one serving pool of small context to a big context serving pool, or something infrastructure wise that falls flat and collapses. Seeing the window so clearly change from 200k to 60k to 100k is both hope, but also, misery.
I've been leaving some breadcrumbs on Bluesky as I go. It's been brutal to see. Especially having tasted a working glm-5.1. I don't super want to pay API rates to someone else, but I fully expect this situation to not reproduce on other hosting, and may well spend the money to try and see. https://bsky.app/profile/jauntywk.bsky.social/post/3mhxep7ek...
All such a shame because aside from totally going mad & speaking unpuncutaed gibberish, glm-5.1 is clearly very very good and I trust it enormously.
GLM5 also had this issue. When it was free on Openrouter / Kilo the model was rock solid though did degrade after 100k tokens gracefully. Same at launch with Zai aside from regular timeouts.
Somewhere around early-mid March zai did something significant to GLM5 - like KV quanting or model quanting or both.
After that it's been russian roulette. Sometimes it works flawlessly but very often (1/4 or 1/5 of the time) thinking tokens spill into main context and if you don't spot it happening it can do real damage - heavily corrupting files, deleting whole directories.
You can see the pain by visiting the zai discord - filled with reports of the issue yet radio silence by zai.
Tellingly despite being open source not a single provider will sell you access to this model at anything approaching the plans zai offers. The numbers just don't work so your choice is either pay per token significantly more and get reliability or put up with the bait and switch.
The bar is very low :(
But I used 70m tokens yesterday on glm-5.1 (thanks glm for having good observability of your token usage unlike openai, dunno about anthropic). And got incredible beautiful results that I super trust. It's done amazing work.
This limitation feels very shady and artificial to me, and i don't love this, but I also feel like I'm working somewhat effectively within the constraints. This does put a huge damper on people running more autonomous agentic systems, unless they have Pi or other systems that can more self adaptively improve the harness.
[[you guys, please don't post like this to HN - it will just irritate the community and get you flamed]]
Interesting.
Hopefully these aren't bots created by Z.AI because GLM doesn't need fake engagement.
Thanks for watching out for the quality of HN...
There are YC members in the current batch who are spamming us right now [2]. They are all obvious engagement-bait questions which are conveniently answered with references to the SaaS.
[0]: https://www.reddit.com/r/DoneDirtCheap/comments/1n5gubz/get_...
[1]: https://www.reddit.com/r/AIJobs/comments/1oxjfjs/hiring_paid...
[2]: https://www.reddit.com/r/androiddev/comments/1sdyijs/no_code...