Fresh Hacker News | Solving a million-step LLM task with zero errors

▲Solving a million-step LLM task with zero errors(arxiv.org)

176 points by Anon84 16 hours ago | 13 comments

> However, this result made it clear that the reliability of state-of-the-art LLMs is fundamentally limited: If they need to complete every step correctly in order to solve a task, after a certain number of steps they will almost surely fail as a result of an underlying propensity to make errors, even when the answer should be obvious. While an error rate of 1-in-1,000 seems low, and would be great on a traditional LLM benchmark, on a task that requires successful execution of thousands of steps in a row, such a system results in inevitable failure.

What a relief to see an obvious problem actually acknowledged. I can't even guess how many times I've been shouted down about this exact topic in the reasoning debates on HN, or seen papers just kind of glossing over it as if it were a non-issue.

The next really natural question is.. if you're committed to decomposing a problem into tons of microsteps and voting.. why aren't we just embracing hybrid symbolic systems? The decomposition step kind of implies you're in a problem domain where variables separate out somewhat cleanly and that this should be doable. As far as I can tell the "voting" discussed in the paper is about candidate outputs, i.e. solutions to subproblems? If you switch to hybrid symbolic systems, then you can vote on candidate inputs to solvers and at least be damned sure that their output is always correct.

Also the success of chain-of-code compared with chain-of-thought approaches could actually imply that having no real solver is maybe not the obstacle you'd expect! Maybe you can invent a semiformal logic just in time that appears to be expressive enough to encapsulate the problem domain, and have the LLM emulate a nonexistent solver. If the error rate with this sort of approach is still too high, then at least you know concretely what solver or formal-language you need to implement in order to improve.

▲cs702 15 hours ago

Nice!

Briefly, the idea is recursively to decompose tasks into the simplest possible steps, recursively call (relatively small) LLMs as agents to execute one step at a time, and using a clever voting scheme to choose how to execute each step. The authors use this technique to get a relatively small LLM to solve Towers of Hanoi with 20 rings (1M steps). All of it using natural language.

The most obvious question is whether other tasks, more interesting -- less "rote" -- than Towers of Hanoi, can similarly be recursively decomposed into simple steps. I'm not sure that's always possible.

▲wordpad 13 hours ago

This works because a problem could be broken down to a prompt which rarely hallucinates.

Most real world prompts can't be reduced to something so consistent and reliable.

Their key finding was that the number of votes grows linearly with number of prompts you are trying to chain.

However the issue is that the number of votes you need will grow exponentially with hallucination rate.

▲hi_hi 2 hours ago

Its LLMs all the way down :-)

This can't be scaled to more generalised tasks. If you solve that then you've solved the hallucination issue.

▲patcon 12 hours ago

> into the simplest possible steps, recursively call (relatively small) LLMs as agents to execute one step at a time, and using a clever voting scheme to choose how to execute each step.

It's like humans! Everything old is new again :)

▲adastra22 14 hours ago

Why not? That's basically how NASA manages large projects.

▲Uehreka 14 hours ago

One issue I often run into with this stuff is the tightly coupled nature of things in the real world. I’ll fashion an example:

Let’s say you break a job down into 3 tasks: A, B and C. Doing one of those tasks is too much for an LLM to accomplish in one turn (this is something you learn intuitively through experience), but an LLM could break each task into 3 subtasks. So you do that, and start by having the LLM break task A into subtasks A1, A2 and A3. And B into B1, B2 and B3. But when you break down task C, the LLM (which needs to start with a fresh context each time since each “breakdown” uses 60-70% of the context) doesn’t know the details of task A, and thus writes a prompt for C1 that is incompatible with “the world where A1 has been completed”.

This sort of “tunnel vision” is currently an issue with scaling 2025 agents. As useful context lengths get longer it’ll get easier, but figuring out how to pack exactly the right info into a context is tough, especially when the tool you’d reach for to automate it (LLMs) are the same tool that suffers from these context limitations.

None of this means big things aren’t possible, just that the fussyness of these systems increases with the size of the task, and that fussyness leads to more requirements of “human review” in the process.

▲adastra22 9 hours ago

I've been experimenting with this with a custom /plan slash command for claude code, available here: https://github.com/atomCAD/agents

Planning is definitely still something that requires a human in the loop, but I have been able to avoid the problem you are describing. It does require some trickery (not yet represented in the /plan command) when the overall plan exceeds reasonable context window size (~20k tokens). You basically have to start having the AI consider combinatorially many batches of the plan compared with each other, to discover and correct these dependency issues.

▲DeathArrow 2 hours ago

>the LLM (which needs to start with a fresh context each time since each “breakdown” uses 60-70% of the context) doesn’t know the details of task A, and thus writes a prompt for C1 that is incompatible with “the world where A1 has been completed”.

Can't that be solved with sub agents? The main agents oversees on combines code and calls sub agents for each tasks.

▲pinkmuffinere 13 hours ago

Reasoning by analogy is great for intuition, but doesn’t guarantee real results hold. Consider “voltage is like water pressure in pipes, so if there’s a cut in my wire’s insulation, the device won’t get enough voltage” — clearly this is not true, even though it relies on an analogy that’s generally useful.

▲alwa 13 hours ago

I really like that analogy, thank you for it. Also applies to “it’s overvoltage, so I just need to poke a little hole in it to let the excess bleed out”…

▲wat10000 12 hours ago

That one can work, briefly, depending on how conductive your tool is.

▲harpiaharpyja 7 hours ago

If air was highly conductive that analogy would totally hold.

"If there’s a cut in my wire’s insulation, the device won’t get enough voltage" doesn't follow from: "voltage is like water pressure in pipes"

So I don't really get your point.

▲pinkmuffinere 4 hours ago

> "If there’s a cut in my wire’s insulation, the device won’t get enough voltage" doesn't follow from: "voltage is like water pressure in pipes"

I absolutely agree! In the same way, "an LLM can solve complex problems if it breaks them into subtasks" doesn't follow from "NASA breaks large projects into smaller parts"

▲CamperBob2 12 hours ago

Well, corona losses are a thing, after all.

▲chadcmulligan 2 hours ago

IBM tried that with CMM (capability maturity model), it didn't work, the problem is NASA knows what they're building, rockets and satellites don't have any grey areas and everything is specified. Other things are less well defined, and the people specifying aren't rocket scientists.

▲etamponi 14 hours ago

"basically" is doing a lot of work in this sentence.

▲Julien_r2 14 hours ago

I could imagine that even a small task at NASA might involve more knowledge and logic than the smallest task for a Hanoi's tower problem.

Depends on what is considered as small enough for the LLM to be resolved with a high confidence.

▲th0ma5 12 hours ago

This is a really good analogy because the complex intersections between multiple groups independently working and trying to collaborate together into a collaborative hierarchy towards one large goal was one of the things that hid a lot of the problems that led to the Challenger disaster, according to Feynmen.

▲adastra22 9 hours ago

It is also what made the space shuttle possible in the first place, so I'd be careful about generalizing too much from that observation.

▲Retric 7 hours ago

The space shuttle’s design was also deeply flawed to the point it failed to do the core objective, significantly lowering costs. Instead the core mission was sacrificed to meet some arbitrary design goals such as being able to de-orbit heavy objects.

That’s the core issue with decomposition of tasks, you aren’t communicating back up the chain and finding globally optimal solutions unless the task is simple enough to be completely understood.

▲mulmen 14 hours ago

NASA has done a lot of amazing things but I wouldn’t bet on them winning a Super Bowl.

▲HarHarVeryFunny 13 hours ago

They'd have a 50% chance of winning one on Mars, since it would just be NASA vs China

▲bangaladore 12 hours ago

Every year NASA has a 50% chance of winning the Superbowl- even on Earth!

Either they win or don't. /s

▲esafak 12 hours ago

It seems like this could be implemented by any harness.

▲naasking 13 hours ago

> All of it using natural language.

Combining this with those approaches that recursively reason in latent space would be interesting.

▲htrp 14 hours ago

> The approach relies on an extreme decomposition of a task into subtasks, each of which can be tackled by focused microagents. The high level of modularity resulting from the decomposition allows error correction to be applied at each step through an efficient multi-agent voting scheme.

Big if that the decomposition and the voting happen accurately for anything other than toy problems

▲yorwba 14 hours ago

The approach in the paper specifically addresses the case where an LLM can usually solve a task when it requires few steps, but fails for the same kind of task with more steps because it randomly gets a step in the middle wrong and then derails. It can't do anything for tasks that the LLM can't solve even when there's just a few steps.

In other words, it compensates for random error, not systematic error.

▲andai 14 hours ago

Worth opening the pdf just for the graph on page 1.

▲esafak 12 hours ago

A striking example of how not to present data. If the Cognizant AI team is here: please can you fix it in the next version of the paper?

▲rdlw 12 hours ago

I think it's a brilliant example of how to use data to make a point.

https://xkcd.com/1162/

▲jmpeax 11 hours ago

Except on figure 1 they're all at 0, making it look like the authors didn't know how to use the models or deliberately made them do nothing.

▲datadrivenangel 11 hours ago

http://www.vibechart.net

▲vessenes 8 hours ago

I was just thinking "these guys will talk about this graph for the rest of their lives", it's the best graph you could ever hope to put into a paper. Loved it.

▲causalmodels 2 hours ago

The dashed lines on top of the data points and labels is making me wince

▲arscan 13 hours ago

In case you want to know what’s going on in the left side of that chart, they gave a log scale in appendix a. I was thinking it was silly to not just use that version on the top, but I guess log scales make big differences ’feel’ smaller.

▲levocardia 3 hours ago

A log scale is actually appropriate in this context from a first-principles perspective. Per scaling laws (and also general behavior of epsilon-probability of failure multiplied N times), you would generally expect more vs. less effective techniques to have multiplicatively greater or fewer steps until failure, not additively greater/fewer. Figure 1 is comical, but the appendix figure is the more scientifically appropriate one.

▲kevmo314 10 hours ago

At that rate, they might as well have gone one step further and made the x axis exponential scale to make it feel even bigger.

▲mNovak 13 hours ago

Really seems like the reason logarithmic scales were invented..

▲LMKIIW 15 hours ago

I dunno, even though the authors address its use, making the task Tower of Hanoi doesn't meet the excitement of the title.

▲charcircuit 13 hours ago

Especially since it's a recursive problem so each step is naturally broken up into subtasks. And the algorithm of what subtasks to break it up in to is public. This makes it much easier for it to get down to a case that the LLM can reliable solve.

▲RossBencina 7 hours ago

I guess that the subtask decomposition of many (sub)problems is known and in the training distribution. How many real-world problems are resistant to divide-and-conquer? Presumably most/all of the unsolved mathematics conjectures. What else?

▲NitpickLawyer 14 hours ago

And yet the reverse paper was posted ad nauseam, covered by every news slop site, and overblown with really negative takes.

▲andai 14 hours ago

I have ADHD and the same approach works for me. (In fact, most days it is essential!)

▲binary132 13 hours ago

do you have an algorithm for breaking down, organizing, and scheduling the small tasks, though? can it also be broken down?

▲lubujackson 10 hours ago

This has seemed to me to be the natural next step to turn LLMs into more deterministic tools. Pushing the frontier is nice, but I think LLMs have a whole different gear when they are able to self-decompose in a reliable way. Most of my success creating reusable LLM products came from determining where requirements/outputs need to be "hard" vs. "soft".

▲DeathArrow 1 hour ago

Some real life problems cannot be decomposed or cannot be decomposed with ease by an LLM.

Also, if we decompose a big task into many tasks, some might be solved in an incompatible way with the rest of the tasks and you can not combine them.

▲sublimefire 12 hours ago

The problem is how to even define a task using the English language and make sure there is enough entropy to infer the detailed intent. For it to be later split into zillions of small steps which can be executed over time by an LLM.

▲gridspy 8 hours ago

In English, that's hard, but there are programming languages ... specialized in breaking a complex task down for computers to understand.

...

▲awei 14 hours ago

one issue I see is when steps in a plan depend on one another, when you cannot know all the next steps exactly before seeing the results of the previous ones, when you may have to backtrack sometimes

▲beefnugs 50 minutes ago

This is actually good insight and worded in a simple way that clicked in my brain, thanks!

▲vatsachak 12 hours ago

And you can decompose the proof of Fermat's last theorem into logical combinators.

The meat is in decomposing the difficult problem into steps

▲mattpk 13 hours ago

Here is the pseudocode of MAKER:

  state = init_state()
  while state is not complete:
    state = LLM("You are a helpful assistant. The rules and format of the game is [...]. The correct strategy to use at each step is [...]. The current state is [...]. Output the state after making the next move")

▲zer00eyz 14 hours ago

On the surface this is an interesting concept...

The paper however, meh...

No mention of MoE. One would think this is a logical evolution of that but not a mention (that I saw). Its own rubric for the task, Towers of Hanoi, was admittedly weak.

LLM papers are starting to look like the last decade of JS frameworks and Tools. Only with less code and more academics, and thats disappointing, because I think a lack of pragmatism and grounding is now holding the field back...