UltraPlan in Claude Code: Cloud Planning With Multi-Agent Architecture

I was halfway through writing a plan for a migration last Tuesday when I realized I had been sitting in plan mode for forty minutes, unable to use my terminal for anything else.

That is the thing about Claude Code's plan mode. It is useful. I have written about how planning is the underrated half of agentic engineering in my piece on harness engineering. But plan mode ties up your local session. You sit there. You wait. You scroll. You refine. And meanwhile, the terminal that you need for actual work is occupied by an LLM thinking out loud.

UltraPlan is Anthropic's answer to that problem, and it goes further than I expected. It is not just "plan mode but in the cloud." It is a fundamentally different architecture for how an AI coding agent thinks through a problem before writing code. Three agents attempt the problem in parallel. A fourth agent criticizes and synthesizes. You review the result in a browser. Then you choose whether to execute remotely or pull the plan back to your local machine.

I have been using it since it landed in the research preview at the start of week fifteen. Here is what it actually does, what it gets right, and what is still missing.

Table of Contents#

What UltraPlan actually is
The multi-agent architecture: explorers and the critic
How the planning workflow changes in practice
The browser review UI and why it matters more than you think
Security architecture: sandboxed VMs and credential proxying
Execution options: cloud PR versus teleport to terminal
Limitations and what is missing
What this means for how we build with agents
FAQ

What UltraPlan actually is#

UltraPlan is a research-preview planning workflow available in Claude Code v2.1.91 and later. The basic idea is that it moves the planning step from your local CLI to a "Claude Code on the web" session. That is the one sentence summary.

But the single sentence hides the interesting part, which is the architecture underneath.

When you invoke UltraPlan, your planning task gets shipped to an Anthropic-managed virtual machine in the cloud. That VM runs three independent "explorer" agents in parallel, each one attempting to produce a plan for your task from a different angle. Once the explorers finish, a "critic" agent reads all three plans and synthesizes the best elements into a single coherent output. You get a browser-based review UI where you can read the final plan, leave inline comments, react to specific sections, and navigate via an outline sidebar.

You invoke it three ways. You can type /ultraplan as a slash command. You can include the word "ultraplan" somewhere in your prompt. Or you can take an existing local plan and push it to the cloud for refinement. The plan moves through three states: drafting, needs input, and ready. The drafting state is where the explorers and critic are doing their work. The needs input state is where the plan is waiting for you to clarify something or make a decision. The ready state means the plan is complete and waiting for your approval.

It requires a Claude Code on the web subscription. If you are accessing Claude through Bedrock, Vertex, or Foundry, this is not available to you right now. That is worth stating upfront because a lot of enterprise teams reading this will immediately wonder about their existing API agreements. UltraPlan is a consumer-side feature for now.

The context window goes up to one million tokens depending on your plan, which is significant. When you are planning a migration that touches two hundred files across six services, context depth matters. I will come back to that.

The multi-agent architecture: explorers and the critic#

This is the part that caught my attention, because I have spent a lot of time thinking about multi-agent patterns. I wrote about them in detail in building production-ready multi-agent systems, and the UltraPlan architecture maps cleanly to a pattern I have seen work well in practice.

Three explorer agents attempt the same problem independently. They do not share context with each other during their runs. Each one produces its own plan. Then the critic agent reads all three and produces the final synthesis.

Why three? And why independent?

The answer, I think, is about coverage of the solution space. If you run one agent through a complex planning problem, it commits to a particular approach early and follows that thread to its conclusion. The approach might be good. It might be the best approach. But you will never know what you missed. Three independent runs with different initial trajectories are more likely to surface different considerations, different risks, different orderings of the work.

This is essentially the same insight behind ensemble methods in machine learning. A random forest outperforms a single decision tree not because any individual tree is better, but because the aggregation smooths out the biases of each one. The critic agent is performing that aggregation step. It reads three plans and picks the best structural decisions from each.

I have built similar patterns myself in production. The version I described in the multi-agent piece used a sequential approach with a researcher, analyst, writer, and reviewer. UltraPlan's pattern is different in an important way. The explorers run in parallel, not sequentially. The critic is not iterating with them. It reads their outputs cold and makes a synthesis decision. That is a deliberate architectural choice. Sequential iteration between agents gives you deeper refinement but is slow. Parallel exploration with post-hoc synthesis gives you breadth and speed at the cost of depth.

For planning specifically, I think breadth is more valuable than depth. A plan that considers three different approaches and picks the best elements from each is more useful than a plan that goes deeper and deeper into one approach. You want the plan to surface the things you did not think of. Depth can happen during execution.

The limitation is that the critic agent has a hard job. It needs to compare three plans that may use different terminology, different decompositions, different assumptions about the problem. It needs to figure out which structural decisions from Plan A combine well with the sequencing decisions from Plan B and the risk identification from Plan C. I do not know how well this works at the boundary conditions, because Anthropic has not published benchmarks for the workflow itself. That is an honest gap. I am going on my own experience over the last week, which is a small sample.

What I can say is that the plans I have gotten from UltraPlan are noticeably more thorough than what I get from local plan mode. They catch more edge cases. They identify more dependencies. Whether that is because of the multi-agent architecture or because of the larger context window or because of some difference in the cloud model configuration, I genuinely cannot say. It is probably all three.

How the planning workflow changes in practice#

The obvious change is that your terminal is free. That sounds small until you experience it.

Before UltraPlan, my workflow for a complex task looked like this: enter plan mode, write the initial prompt, wait for the plan to generate, read the plan, refine it two or three times, approve it, and then execution begins. During that entire planning phase, my terminal is locked. I cannot run tests. I cannot check git status. I cannot open another Claude Code session in the same directory without risking file conflicts. I am just sitting there reading plan output and typing refinements.

With UltraPlan, I kick off the planning task and go do something else. Check Slack. Run the test suite for the thing I finished yesterday. Review a PR. The plan is being generated in the cloud, and it will be there when I am ready to review it.

This is the same non-blocking pattern that has made CI/CD pipelines so much more productive than local builds. The insight is not new. Offload the heavy computation so the developer can keep working. But applying it to the planning step of an agentic workflow is new, and it changes how you think about the cost of planning.

Here is what I mean by that. Before UltraPlan, there was a real time cost to planning carefully. If you spent thirty minutes in plan mode refining a plan, that was thirty minutes of blocked terminal time. So there was an incentive to plan less thoroughly and let the agent figure things out during execution. That incentive was bad. Poor planning leads to wasted execution cycles, which leads to wasted tokens, which leads to wasted money. But the incentive existed because developers are human and waiting is boring.

UltraPlan removes that incentive. Planning can take as long as it needs to take because you are not paying for it with blocked time. That is a subtle shift, but I think it is the most important practical change.

The three states of the plan, drafting, needs input, and ready, give you a lightweight state machine to track progress. The needs input state is interesting because it means the planning agents can ask you questions. If the explorers hit an ambiguity in your task description, the plan pauses and asks for clarification rather than guessing. That is better behavior than what local plan mode does, which is usually to make an assumption and continue. Assumptions made by agents during planning often turn into bugs during execution.

The browser review UI and why it matters more than you think#

The review UI is a browser-based interface where you read and interact with the final plan. It has inline comments, section-scoped reactions, and an outline navigation panel.

The inline comments are the important part. In local plan mode, if you want to modify a plan, you type your feedback into the terminal as a single message. You say something like "the migration order should be reversed, start with the payment service first." The agent reads that and regenerates the plan. But you lose the spatial relationship between your feedback and the part of the plan it applies to.

In the browser UI, you click on the specific section of the plan that needs changing and leave a comment right there. The agent can see exactly which paragraph, which step, which decision your feedback is attached to. That is not a cosmetic improvement. It is a precision improvement. Anyone who has done code review knows the difference between "there's a bug somewhere in this file" and a comment on the exact line. UltraPlan's review UI brings that same precision to plan review.

The section-scoped reactions are lighter weight. You can signal that a particular section looks good, or that you have concerns, without writing a full comment. The outline navigation is straightforward. For a long plan with twenty or thirty steps, you need a way to jump around without scrolling. Nothing revolutionary, but necessary.

There is something else the browser UI enables that I did not expect to matter as much as it does. It creates a shareable artifact. A plan in your local terminal is ephemeral. It is in your scrollback buffer and nowhere else. A plan in the browser UI has a URL. You can send it to a colleague. You can reference it in a ticket. You can come back to it three days later when you resume the task. That persistence changes the plan from a throwaway artifact into a document, and documents have a longer useful life than scrollback.

I realize I am spending a lot of words on a review UI. But the review step is where plans succeed or fail. A perfect planning algorithm that produces output nobody carefully reviews is worse than a mediocre planning algorithm with great review tooling. The bottleneck in planning has always been the human review step, not the generation step. UltraPlan puts real effort into the bottleneck.

Security architecture: sandboxed VMs and credential proxying#

When Anthropic announced that UltraPlan runs your code in an Anthropic-managed VM, the first question from every security-conscious developer was obvious. What happens to my credentials?

Here is the model. The VM is isolated. Your GitHub authentication is handled through a secure proxy. Your credentials never enter the sandbox itself. The proxy authenticates on your behalf but does not pass your tokens or SSH keys into the environment where the agent code runs. Outbound traffic from the VM goes through an allowlist proxy with audit logging. The VM cannot reach arbitrary endpoints on the internet. It can reach GitHub and whatever other services are on the allowlist, and that is it.

This is a reasonable architecture. It follows the principle of least privilege. The agent can read your repository and create pull requests, but it cannot exfiltrate your credentials or reach services it does not need. The audit logging means you have a record of every outbound request the VM made during your planning session.

I want to be precise about what this means and what it does not mean. It means that Anthropic has built a defensible security posture for the current threat model, which is: prevent the agent from leaking credentials or reaching unintended services. It does not mean the system is immune to all possible attacks. The agent still has access to your repository contents. If your repository contains secrets that should not be in the repository, that is your problem, not UltraPlan's. If the allowlist proxy has a misconfiguration that allows unintended egress, that is Anthropic's problem. No security architecture is perfect. This one is reasonable.

For enterprise teams evaluating this, the question is whether your security posture allows code to be processed on Anthropic-managed infrastructure at all. If you are in a regulated industry where data cannot leave your environment, UltraPlan is not an option regardless of how good its sandboxing is. This is a business decision, not a technical one. The technical architecture is sound. The business question is about data residency and compliance, and that varies by organization.

The audit logging is worth calling out specifically. If something goes wrong, you have a trace of what the VM did. That is not just a security feature. It is a debugging feature. When a plan makes a recommendation that seems off, being able to trace what the explorers actually looked at during their runs would be valuable. I have not seen whether the audit logs are exposed at that level of detail, but the infrastructure for it exists.

Execution options: cloud PR versus teleport to terminal#

Once you approve a plan, you have two options for execution.

The first is to let it run remotely. The cloud VM executes the plan and creates a pull request when it is done. You review the PR the way you would review any other PR. This is the fully non-blocking option. You approve the plan, close the browser tab, and come back when the PR notification hits your inbox.

The second is to "teleport" the plan back to your local terminal. This pulls the approved plan into your local Claude Code session and begins execution there. You are back in the local flow, watching the agent work in your terminal, able to intervene if something goes sideways.

Both options have their place. The remote execution path is better for well-defined tasks where you trust the plan and want maximum parallelism. Kick off three UltraPlan sessions for three independent tasks, approve all three plans, and let them run in the cloud while you do something else entirely. That is a genuine productivity multiplier.

The teleport path is better for tasks where you want to stay hands-on. Maybe the task touches a particularly sensitive part of the codebase. Maybe you want to run tests locally as the agent makes changes. Maybe you just want to watch and learn from how the agent approaches the problem. The teleport gives you the benefit of the cloud-based planning without giving up local control of execution.

I find myself using remote execution for tasks I understand well and teleport for tasks where I am less certain about the outcome. The heuristic is roughly: if I would trust a junior developer to execute this plan without supervision, I use remote execution. If I would want to pair with them, I use teleport.

The remote execution path creates a PR, which means it interacts with your repository in a way that is visible and reviewable. That is a good default. A PR can be reviewed, discussed, and reverted. An agent that silently commits to main is an agent that creates irreversible damage. The PR-based workflow is the same pattern I described in the piece about why the moat was never the orchestrator. The execution layer is less important than the review and approval layer. UltraPlan gets this right.

One thing I want to call out. The remote execution happens on the same Anthropic-managed VM with the same security constraints I described earlier. It is not running on your local machine, your CI, or your cloud account. It is running in Anthropic's sandbox. The PR it creates is the artifact that crosses the boundary from their infrastructure back into yours. That is a clean interface.

Limitations and what is missing#

UltraPlan is a research preview. That label is doing real work here. There are things that are missing or rough around the edges, and they are worth naming.

First, there are no published benchmarks. Anthropic has not released numbers on how UltraPlan plans compare to local plan mode in terms of plan quality, execution success rate, or any other metric. The multi-agent explorer-plus-critic architecture is intuitively appealing, and my anecdotal experience has been positive. But I cannot point to a controlled study that says "UltraPlan plans result in 30% fewer execution failures" or anything like that. This is a gap. I suspect Anthropic is collecting this data internally, but until it is public, you are running on vibes and personal experience. I am being transparent about that.

Second, Bun has issues with the network proxy. If your project uses Bun as its package manager, you may hit problems. The allowlist proxy that constrains outbound traffic apparently does not play well with how Bun handles network requests. This is a specific and annoying limitation if your stack depends on Bun. I do not know whether it is a fundamental incompatibility or just something that has not been fixed yet in the research preview.

Third, the subscription requirement limits who can use it. UltraPlan requires a Claude Code on the web account. If your organization accesses Claude through Amazon Bedrock, Google Vertex, or Azure Foundry, you cannot use UltraPlan. For individual developers and small teams, this is fine. For enterprise teams that have standardized on one of those cloud providers for their AI access, it is a blocker. Anthropic will presumably expand access over time, but right now, the addressable market for UltraPlan is smaller than the addressable market for Claude Code itself.

Fourth, the needs input state, while better than silent assumptions, introduces a synchronization point. If the explorers hit an ambiguity at 2 AM and you are asleep, the plan sits in needs input until you wake up and respond. For a feature that is supposed to be non-blocking, this reintroduces a form of blocking. It is the right tradeoff, asking is better than guessing, but it means UltraPlan is not truly fire-and-forget. You still need to be available for questions.

Fifth, I am not sure how well the critic agent handles conflicting plans. When the three explorers produce broadly similar plans with minor variations, the synthesis is straightforward. When they produce fundamentally different approaches, the critic has to make a judgment call about which approach is better. I have seen cases where the critic seems to default to a safe middle ground rather than making a bold choice, averaging the plans rather than picking the best one. Averaging is not always the right move. Sometimes Explorer B had the right idea and the others were wrong, and the correct synthesis is to just go with Plan B. I do not know if the critic can do that. My sample size is too small to be confident.

Sixth, there is no way to configure the explorer agents. You cannot tell the system to have one explorer focus on performance and another focus on backwards compatibility. The three explorers are, as far as I can tell, given the same prompt and left to diverge on their own. Giving users some control over the exploration axes would make the multi-agent architecture significantly more powerful. I expect this to come eventually, but it is not here yet.

Seventh, and this is minor, the browser UI does not have a dark mode toggle that I could find. For a tool made by a company that ships a terminal-based coding agent used primarily by developers who stare at screens all day, this is a surprising omission.

What this means for how we build with agents#

UltraPlan is one feature in one tool, but it points at a bigger shift in how agentic workflows are going to evolve.

The shift is from single-agent monoliths to multi-agent pipelines with specialized stages. We have been talking about this shift for a while. I wrote about it in the context of building multi-agent systems. But UltraPlan is notable because it brings the multi-agent pattern to a stage of the workflow that most tools treat as a single-agent task: planning.

Think about what Claude Code looked like six months ago. You typed a prompt, the agent planned, the agent executed. One agent, one flow, one session. The plan was an internal step that happened inside the agent's context window. You could enter plan mode to review it, but the planning was still fundamentally one agent thinking through the problem once.

UltraPlan splits the planning stage into a pipeline of its own. Three explorers. One critic. A review UI. An approval gate. Execution as a separate stage with its own options. The planning step, which used to be a single internal monologue, is now a multi-agent workflow with its own architecture.

That is interesting because it suggests that every stage of the agentic workflow is going to get this treatment eventually. Execution will get it. Code review will get it. Testing will get it. Each stage will evolve from a single agent doing one pass to multiple agents doing parallel passes with synthesis and human review.

When that happens, the developer's job shifts from writing code to reviewing plans and reviewing code. You become the person who approves the plan, approves the PR, and occasionally steps in when the agents get confused. That is a controversial statement, and I want to be careful about what I am and am not claiming. I am not claiming this is happening tomorrow. I am not claiming it works for all tasks. I am claiming that UltraPlan is a concrete step in this direction, and the architecture patterns it introduces are going to show up in more tools over the next year.

The other interesting signal is the cloud offloading. Running planning in the cloud rather than locally is a pragmatic choice, but it also establishes a precedent. If planning can run in the cloud, so can execution. If execution can run in the cloud, the local CLI becomes a thin client for kicking off and reviewing cloud-based agent sessions. That is a different model of development than what most of us do today, and it is worth thinking about even if you are not ready to adopt it.

I keep coming back to a principle I have written about before: the moat in agent systems is not the runtime. It is the workflow, the evaluation criteria, the domain knowledge, the feedback loops. UltraPlan reinforces this. Anthropic is investing in better runtime infrastructure, planning, multi-agent, cloud VMs, review UIs, so that developers can focus on the parts of their systems that actually differentiate. That is the right layer to build on. If someone else wants to handle the orchestration of planning agents, let them. Your value is in knowing what the plan should achieve, not in running the planner.

For teams building their own agentic tools, UltraPlan is worth studying as a reference architecture even if you never use it directly. The pattern of parallel exploration followed by critical synthesis followed by human review followed by gated execution is a good pattern. It applies to more than just code planning. It applies to research, analysis, content generation, and any creative task where exploring multiple approaches before committing is valuable.

The question I am sitting with is how far this goes. Does every stage of the development workflow eventually become a multi-agent cloud pipeline? Or does the overhead of coordination eat the benefits once you get past a certain complexity? I do not have an answer. I have a research preview and a week of experience. But the direction is clear, even if the destination is not.

FAQ#

What version of Claude Code do I need for UltraPlan?#

You need Claude Code v2.1.91 or later. You also need a Claude Code on the web subscription. The feature is not available if you access Claude through Amazon Bedrock, Google Vertex, or Azure Foundry.

How do I start an UltraPlan session?#

Three ways. Type /ultraplan as a slash command in Claude Code. Include the word "ultraplan" in your prompt. Or take an existing local plan and push it to the cloud for refinement. All three methods move the planning work to Anthropic's cloud infrastructure.

Is UltraPlan free?#

It is included with a Claude Code on the web subscription. There is no separate charge for UltraPlan itself, but you do need the subscription, which is a paid product. The specifics of what is included may depend on your plan tier, as the one million token context window availability varies.

How does UltraPlan handle my repository credentials?#

Your GitHub credentials are proxied through a secure authentication layer. They never enter the sandbox VM where the planning agents run. All outbound traffic goes through an allowlist proxy with audit logging. The agent can read your repo and create PRs, but your tokens and SSH keys stay outside the sandbox.

Can I use UltraPlan for tasks other than code planning?#

UltraPlan is designed for code planning within Claude Code. The planning output is a structured plan for a coding task, and the execution options, creating a PR or teleporting to your local terminal, are both code-oriented. The underlying architecture, parallel exploration with critical synthesis, is a general pattern, but the UltraPlan product is specifically built for software development workflows.

What happens if the planning agents need more information from me?#

The plan moves to a "needs input" state and waits for your response. This is a synchronous step. The explorers or critic have hit an ambiguity in your task description and need clarification before continuing. You respond in the browser UI, and planning resumes. This is better than the alternative, which is the agent making assumptions and continuing with potentially wrong context.

Is UltraPlan suitable for production use in enterprise environments?#

It is a research preview as of April 2026. That means it is experimental, may change significantly, and should not be relied on for critical production workflows. The security architecture is reasonable, with isolated VMs, credential proxying, and audit logging. But enterprise teams in regulated industries need to evaluate whether sending repository contents to Anthropic-managed infrastructure fits their compliance requirements. The research preview status also means there are no SLAs or uptime guarantees.