59 points by nab 3 hours ago | 13 comments
peterldowns 14 minutes ago
What kind of cpu/memory do the vms get? Is there a way to define the template that's used, so I can say to a new team member, log in to boxes.dev and all the repos and tools are already there for you? And where do you get the machines, can we bring our own? The orchestration layer and product experience ticks all the boxes for me but where Codex, Claude, and Cursor have fallen down for me in the past is:

- slow and outdated vms

- horrible/no way to standardize environments for my team

- no way to bring our own compute to help resolve these issues ^

cohix 1 hour ago
I really like the pricing model and focus on not shafting people by auto-sleeping when an agent is done working.

I’ve been working on an [OSS TUI](https://github.com/prettysmartdev/awman) for managing agent execution and workflows in containers (local or remotely) and would love to collaborate if you’re interested.

pickleglitch 30 minutes ago
You can pry localhost from my cold dead hands.
amirhirsch 35 minutes ago
This looks very clean, great job!

If your CTO didn't spend the past year making an orchestration tool and a baby is he even qualified?

I have a vibe-coded orchestrator that I use to manage my claude and codex sessions across multiple machines, can also spin up sprites from fly.

https://github.com/tinkerer/propanes

warning: it is probably totally unsuitable for anyone else to use except for me

The main idea is a widget that you embed in your apps that lets you select elements, paste screenshots, and prompt what to change. This workflow is very productive for me. I would encourage everyone to add element selection to their orchestrators prompt composers. If you watch the looms on the readme note that my CLAUDE.MD calls me a Meat Computer and reminds me to hydrate.

I have a native tauri version that lets you select UI elements through the macos accessibility api too.

The session service uses tmux so you can open a native terminal via ssh and tmux attach. I add a ton of features that are in varying degrees of half-baked: the "brainstorm" mode allows you to do microphone transcription while interacting with the DOM and it will suggest tickets automatically. I've also been working on "bd2sdd" which is supposed to take your strings of user inputs and transform it into a spec, presumably because I also desired regressions. There are Wiggums (which aren't relevant anymore with /goal) and "FAFO swarms" (fan-out, aggregate, filter, optimze) which I use to reverse engineer other pieces of software, PowWow for codex and claude to work together.

I stole the structured views and remote session control from my friend's Agent Portal project txcl.io which is more fully-baked and narrower scope than propanes.

The ticketing system / tmux / structured views has been slowly evolving into multi-agent chat with a primary "Chief of Staff." It integrated pretty nicely into Slack.

iloveluce 1 hour ago
Interesting. Given that OpenAI and Anthropic are steadily moving down the stack (e.g. remote execution, Codex desktop, Claude Code integrations), how do you think about defensibility? Do you expect the labs to eventually offer a cloud-native ADE themselves, and if so, what advantage do you think an independent platform retains?

Also, do you see Boxes supporting OpenCode and self-hosted/local models in the future? If the rented machines have enough RAM and GPU access, it seems like there could be an interesting path toward a model-agnostic platform rather than being tied to the frontier labs.

nab 1 hour ago
A few angles to this. One is that coding just went through a massive change over the past year, that is not yet fully settled. Remember when everyone insisted on using IDEs and seeing the code with a chat sidebar? It's hard to argue you'll still be reading code a year from now. And even today, most people are still developing locally, which we're betting will shift to the cloud over the next few years.

I imagine other players will build cloud support in their own apps, but even now there's a lot of distraction for them. Everyone is trying to still support local execution, which looks really different from cloud. A lot of the labs are taking their coding-focused teams and throwing non-coding on their plates as well (the same app for non-engineers slinging google sheets).

We think getting the cloud experience right for software engineers (as well as companies, with their own hosting/development needs) is going to be really hard, and the problem needs a team fully focused on that. We also think that companies are rightly nervous about putting all their eggs in one basket -- their long term development environment should be harness and model agnostic.

RE OpenCode + self-hosted/local models: definitely. There's nothing holding us back from supporting these since we're just linux machines. But we wanted to start with the most popular harnesses first and go from there.

hasteg 7 minutes ago
Maybe I'm in the minority but I still program with an IDE and a chat window in the side at work, as well as when I work on side projects. I do like to actually see the code that is getting produced.
gazebo2 10 minutes ago
>It's hard to argue you'll still be reading code a year from now

groan

shivekkhurana 34 minutes ago
I have gotten into the habit of keeping the Codex app open on my laptop, and using the ChatGPT app on my phone as a remote. Maybe hosting is the way to go!
indigodaddy 1 hour ago
I might use this if it supported any old cloud or VPS, and was at most $10/mo. The fact that you have decided that this platform should only live in your own custom cloud is unappealing to me.

Or, open source it and let us run it on our own VPS and keep your expensive cloud for those who want to pay. As it stands would never consider it.

aliclark 24 minutes ago
I'm building something like this that you can run in your own cloud!

https://flexenv.com/

It's nowhere near advanced as boxes.dev but it's built on the premise of running on any cloud. Indeed I have it running on two different bare metal server providers and I'm about to add a third (Azure) as I'm using my day job as my first customer.

Can I grab your contact details and schedule a demo?

nab 1 hour ago
Thanks a ton for the feedback. Yeah, this is something we'll try to solve in the long term. One of the things that makes this work really smoothly for setup and speed is the ability to have a template box that you can instantly snapshot and fork (disk and RAM) to spin up new machines. There aren't many sandbox providers that do that well for running a full app and development environment, but I'm sure there will be more over time. And the per-second pricing means that you only pay when your agent is running.

You could use VPS, but spinning up and down boxes on inactivity takes a long time, and making changes to the template for new machines is less trivial there. If you're only paying for 1 VPS box, then you lose the "multiple independent machines" benefit, and I imagine things start to get more expensive even in the VPS world when you have 10 of them running at the same time (one per thread).

indigodaddy 1 hour ago
Pretty sure you could accomplish this in a large physical server or even a huge resource VM (that has KVM passthrough) with some sort of microvm technology? Then that would obviate the need for "multiple cloud instance per coding thread", it would just be a microvm on the large server.

Then again, I'm just the guy running his mouth, and you guys are the ones actually doing the work :)

BTW, looks very polished and thought-through, I may have to still give it a try!

dregitsky 2 minutes ago
Nope you're exactly right - we're using microVMs today (Firecracker VMs via E2B) and running that same shape but on customer-owned machines is definitely one approach we're looking into.

And thank you!

soco 12 minutes ago
It feels somehow weird to see a cloud tool usable only from Macs. Oh well.
Arcuru 1 minute ago
It's even weirder that their long post doesn't mention it's Mac only.
__natty__ 1 hour ago
Maybe I’m naive but the longest single workflow I ran was maybe 15 minutes. How do you steer agents to run “overnight”? And what is the quality of such execution?
Bnjoroge 2 minutes ago
Works well for very well defined task. If you have a really big feature like a front end migration, you can use /plan, and /goal which i think is in most harnesses. You can also use other tools that allow your agent to interact with other terminals(I use an ADE called orca) that has an orca skill where an agent can spin up different sessions(different from subtasks because they share the context and you can chose the harness/model unlike sub agents). Can also read from the terminal, use your browser or computer and task screenshots and after prepare a report or something.
dregitsky 12 minutes ago
To add to what @nab said, the longest ("overnight") runs are usually after going back and forth to build out a big multi-phase plan doc -- especially when each phase has an extensive manual test plan (agent runs the app in a browser, clicks through the workflow, watches logs, confirms behavior, etc).

These can go for many hours from all the manual testing and debugging. Quality really depends on how much you spec things out beforehand, and how you define the test plan / "success" gates. If the agent can't even run the app to test it then things can definitely go off the rails!

notrealyme123 1 hour ago
Usually coding where the closed loop evaluation takes time.

E.g code debugging

nab 1 hour ago
This. Very few people are doing this right now (probably because it sucks having 5 copies of your app running in parallel on your laptop), but in the past few months models have gotten really good at testing your running app live. If you have an environment where you can run your full app and models can get it at via playwright and chromium, they can click around, take actions, and actually verify that their code works.

With boxes.dev I've starting pushing agents harder to run the full app and test their work end to end, and send me screenshots as proof. This takes time, sometimes up to 30-40 minutes, but is much more likely to be bug free at the end of the day.

ai_slop_hater 44 minutes ago
I think they are just bullshitting.
FergusArgyll 48 minutes ago
In codex, is you use /goal it can go for a while. I've never seen overnight but > 1 hr is common
smrtinsert 16 minutes ago
"build me a 10 million dollar MRR saas, make no mistakes"
servercobra 1 hour ago
Nice, this looks exactly like what I've been looking for. I tried Fly.io Sprites and it _almost_ got me there, but I got annoyed logging into my CC every new feature. Unfortunately I wound up going all in on Cursor Cloud Agents, which overall has been decent.
Bnjoroge 1 hour ago
What are “box-hours”? Regular hours just running in boxes? Do I get charged the same when 1)the agent is doing some external thing say web search that takes a while, and 2) when the agent isnt running(say waiting for my input)?
dregitsky 1 hour ago
It's just one hour of runtime. But we put the machines to sleep very quickly once the agent finishes its work, and then wake when you interact in the UI (e.g. terminal, filesystem, send the agent a followup). We're running on firecracker microVMs so can sleep/wake very quickly, which keeps things nice and responsive.

Re: web searches -- we're running a full linux kernel and the agent runs on the machine itself, so we can't sleep mid run. But conceptually, moving the agent off-box and sleeping during web searches etc would be interesting, but in our experience coding agents are running enough stuff on the machine itself (rg, bash, playwright, etc) that there wouldn't be much savings.

ai_slop_hater 45 minutes ago
> ditch localhost; run Claude Code and Codex in the cloud

Why would I want this and not the other way around?

pavelpilyak 1 hour ago
How does this handle MCP credentials - both for stdio servers that read tokens from local config, and for HTTP ones where harness holds an OAuth token? Either way those secrets end up in your cloud? Curious what the security model is
nab 1 hour ago
Right now the way you'd do this is you'd select the "Main box" or template VM in the UI, pull up a terminal tab, and authenticate whatever MCPs you care about. These are stored however the MCP is storing them (likely filesystem) on the VM. When you're done, you can "snapshot" the template VM and all future forks/new threads will start from that snapshot of filesystem + RAM.

We recommend you auth with only development credentials (or use something like 2 factor confirmation if you have more sensitive things you want to confirm before the agent accesses), but it's still early for us and we're continuing to refine this as we go. For companies, we're down to brainstorm how they'd like this to ideally work for them. And over the long term we'll support hosting this in your own cloud.

Curious if you have a take on how you'd like this to work from a UX standpoint.

EmiliaStar 15 minutes ago
[flagged]