Patterns
Pattern 06 of 26
Computer Use Agents (CUA)
If a human can click it, an agent can reach it
Most software in the world does not have an API. Computer use is the pattern that solves that. The model takes a screenshot, figures out what it is looking at, then outputs clicks and keystrokes like a human would. Anthropic launched it in October 2024. OpenAI followed in January 2025. It is slower and less reliable than API-based tool use, but for legacy systems and anything behind a GUI it is often the only option you have.
Why it matters
I think of this as the long tail of automation. The well-maintained SaaS tools have APIs. Everything else, the internal dashboards, the government portals, the decade-old desktop software that the operations team depends on, does not. Computer use is the only way to reach those. You pay for it in latency and occasional breakage, but sometimes that trade is worth making.
Deep Dive
Computer use works by giving the model vision and action. The model receives a screenshot of the current screen state, reasons about what it sees, and produces action outputs: click at these coordinates, type this text, scroll down. Then the loop repeats. Anthropic launched this with Claude 3.5 Sonnet in October 2024, OpenAI followed with their Computer-Using Agent in January 2025. Each step in the loop requires a full round-trip to the model, which is why latency compounds quickly on multi-step tasks. A five-step human task might take thirty seconds in a browser. The same task via computer use might take three to five minutes.
The coverage argument is the strongest one for this pattern. Most enterprise software, most internal tooling, most government-facing workflows, none of it has a public API. Browser Use, the leading open-source library for browser-based computer use with 58,000+ GitHub stars, exists entirely because this gap is real. It handles browser-specific interactions like navigation, form filling, and dynamic content rendering, without needing API access to the sites it controls. That is a meaningful capability unlock for anyone building automations in environments they do not control.
OSWorld, the NeurIPS 2024 benchmark for computer use agents in real environments, put early task completion rates at around 10-15%. That sounds discouraging. I read it differently. The benchmark covers open-ended tasks in unpredictable environments. For narrow, well-defined tasks on consistent interfaces, accuracy is substantially higher. The pattern breaks down when the UI is dynamic, when element positions shift between sessions, or when the task requires judgment calls that a screenshot alone does not provide enough context for. The skill is scoping your use case to the reliable quadrant.