Harness Engineering · Case Study

The Dark Factory

Inside OpenAI's Symphony experiment. Three engineers. Five months. One million lines of production code. Zero human authorship. Zero pre-merge review.

1M+Lines of Code
1BTokens / Day
<60sBuild Loop
The Old Model
Humans write. Humans review. Humans merge.
Code generation is the bottleneck. Every PR needs human approval. Agents are "assistants" that propose diffs for humans to accept or reject.
Sequential Review-Gated
What Is A Dark Factory
Agents write, review, and merge. Humans run capability analysis.
Ryan Lopopolo's team at OpenAI Frontier ran a five-month experiment. Three engineers, one empty repo, one rule: nobody writes code by hand. They built an Elixir orchestrator called Symphony and shipped 1M+ lines of production code across 1,500+ PRs, all autonomously merged. Humans did not review PRs. They watched patterns and fixed the harness.
Symphony Elixir Orchestrator Post-Merge Review
Mental Model
A factory with the lights off
Manufacturing's "dark factory" runs without humans on the floor. Machines do the work. Humans design the process. The software version: agents own the keyboard, humans own the specification.
Specs > Source
60-Second Build
Build speed is clock speed
Rule: builds complete in under 60 seconds, always. Every idle minute across 20 parallel agents is a minute of burned tokens.
Make Bazel Turbo NX
500+ NPM packages
Post-Merge Review
No human PR gate
Code merges if CI passes. Humans sample output after the fact, looking for patterns, not individual defects. Agent-reviewing-agent runs P0/P1/P2 triage.
P0 Block P1 Flag P2 Inform
Economics
Token billionaire math
~1 billion tokens/day consumed. Daily spend is real, but still cheaper than the team it would otherwise take.
$2-3K / day $60-90K / mo 3-7 engineers
PR Velocity
The model upgrade compounds
Humans did not work faster. The models did. Each generation turned previously hard tasks into routine ones.
Pre-5.2: 3.5 PRs/eng/day Post-5.2: 5-10 PRs/eng/day
Symphony Stack · Elixir Orchestrator
Six layers that turn 20 parallel agents into a factory
L6ObservabilityPrometheus, Jaeger, Grafana. Traces, logs, metrics, dashboards.
L5IntegrationGitHub PRs, Linear issues, Slack, observability APIs.
L4ExecutionTask runners, skills, CLI invocations. Where agents actually work.
L3CoordinationElixir process supervision. Lifecycle, restart, isolation.
L2ConfigurationEnvironment setup, tool exposure, blast-radius control.
L1PolicyHard guardrails. CI must pass. Security rules are non-negotiable.
The Role Shift
From code review to capability analysis
BeforeAfter
Review every PR line by lineSample outputs for patterns
"Is this code correct?""Why did the agent fail here?"
Gate merges sequentiallyMerge on green CI, observe
Fix bugs in the PRFix capability gaps in the harness
Bottleneck: human attentionBottleneck: specification quality
The new question: not "did the agent write the right code" but "does the harness give the agent everything it needs to write the right code."
Ghost Libraries
Distribute specs, not source code
Ship a specification. The agent reads it and reproduces the library locally, tailored to your codebase. No version conflicts. No shared source. No supply chain attacks.
Speculative Not hypothetical
Still Human Territory
What agents cannot do yet
Agents follow patterns. They do not invent them.
The Core Insight
The only fundamentally scarce thing is synchronous human attention. Models are trivially parallelizable.
— Ryan Lopopolo, OpenAI Frontier Product Exploration
Translation: if you can run 20 agents in parallel and each produces working code, the bottleneck is not authorship. It is the quality of the specifications and constraints they receive. That is harness engineering.
What To Adopt Today · Any Team Size
Five dark-factory practices that work at solo-dev scale too
PracticeWhy It MattersSmallest Version
1. Measure the build loopBuild time is the clock speed of every agent you run.Fix anything over 2 minutes.
2. Encode taste as textAgents consume CLAUDE.md, specs, and quality scores as context.Turn tribal knowledge into markdown.
3. Consider post-merge for low-riskNot every PR needs a human gate. Observability replaces review.Auto-merge on green CI for docs and tests.
4. Treat code as disposableIf the spec is good, regenerating is cheaper than defending.Throw away, do not merge-conflict-resolve.
5. Invest in agent observabilityYou cannot fix capability gaps you cannot see.Log every agent action with structured traces.
Source: Latent Space / Ryan Lopopolo · OpenAI Frontier · sangampandey.info Dark Factory