18 min read

GLM-5.1: Open Weights Just Hit Frontier Coding

GLM-5.1 tops SWE-Bench Pro with open weights, 744B MoE architecture, and 8-hour autonomous sessions. What this means for agentic engineering.

aiopen-sourceagentic-engineeringbenchmarksglm-5

I have been saying for two years that the gap between open-weight models and closed APIs would close. I was wrong about the timeline. I thought it would take until late 2027. GLM-5.1 closed that gap last week.

Z.ai, formerly Zhipu AI, released an open-weight model that tops SWE-Bench Pro. Not by a tiny margin on some obscure benchmark nobody cares about. On the benchmark that actually measures whether a model can fix real GitHub issues, GLM-5.1 scored 58.4. That puts it ahead of GPT-5.4 at 57.7 and Claude Opus 4.6 at 57.3. And the weights are sitting on Hugging Face right now under an MIT license.

View the full interactive infographic →

Table of Contents#

What GLM-5.1 Actually Is#

Let me start with the architecture, because the numbers here are genuinely interesting.

GLM-5.1 is a 744 billion parameter Mixture-of-Experts model. It has 256 experts, of which 8 are active at any given time. That means you are running roughly 40 billion active parameters per forward pass. For context, that is about the same compute footprint as a dense 40B model, but with the knowledge capacity of something much larger. MoE is not new, but executing it well at this scale, with this level of routing stability, is not trivial.

The context window is 200,000 tokens. Output can go up to 128,000 tokens. That output length is worth pausing on. Most models cap at 8K or 16K output tokens. Having 128K means the model can generate entire files, complete implementations, long-form analysis, without hitting the output ceiling that forces you to chunk your requests.

This is a text-only model. No image input, no audio, no video. Z.ai made a deliberate choice to focus entirely on language and code. Whether that is a limitation or a strength depends on your use case, but for coding agents, it is the right trade-off. You do not need multimodal capabilities to fix a GitHub issue or scaffold a service.

The model ships under an MIT license. You can download the weights, fine-tune them, deploy them commercially, build products on top of them. No usage restrictions, no phone-home requirements, no license that changes next quarter. This matters more than most people think. When you are building production infrastructure, license stability is not a nice-to-have. It is a requirement.

The team behind it#

Z.ai started as Zhipu AI, a Beijing-based company that spun out of Tsinghua University research. They have been building the GLM series for several years now. GLM-4 was solid but unremarkable. GLM-5 made noise in China but did not get much attention internationally. GLM-5.1 is the release where the rest of the world has to pay attention.

The team has been relatively quiet about their process compared to the blog-post-heavy culture at Anthropic or the product-launch spectacle at OpenAI. What they have done instead is ship a model that speaks for itself. The paper is dense, the benchmarks are comprehensive, and the model is available for anyone to test.

The Benchmark Story#

Let me walk through the numbers, because the full picture is more nuanced than "GLM-5.1 is number one."

Coding benchmarks#

SWE-Bench Pro is the headliner. GLM-5.1 scored 58.4, which is the current state of the art. For comparison: GPT-5.4 sits at 57.7, Claude Opus 4.6 at 57.3, and Gemini 3.1 Pro at 54.2. The margin over GPT-5.4 is less than a point, so this is not a blowout. But being on top, with open weights, is the story.

SWE-Bench Verified, which is the curated subset with confirmed reproducible issues, shows GLM-5.1 at 77.8%. That is a strong result. It means the model is not just solving edge cases that happen to match its training data. It is handling verified, real-world bug fixes at a high rate.

Terminal-Bench 2.0, which tests command-line and systems-level tasks, shows a score of 63.5. NL2Repo, which measures the ability to generate entire repositories from natural language descriptions, comes in at 42.7. These are less flashy numbers but they matter for agentic engineering workflows where the model needs to do more than edit a single file.

Reasoning benchmarks#

AIME 2026 (math competition problems): 95.3. That is near-ceiling performance on competition math. GPQA-Diamond (graduate-level science questions): 86.2. Respectable, though Claude Opus 4.6 leads this category at 91.3. HLE (Humanity's Last Exam): 31.0 text-only, and 52.3 when the model has access to tools. The tool-augmented number is the one that matters for how most people will actually use this model.

Agentic benchmarks#

This is where GLM-5.1 starts to separate itself from the narrative of "just another good model."

CyberGym: 68.7. This tests the model's ability to complete cybersecurity challenges autonomously. BrowseComp: 68.0 base, climbing to 79.3 with context management enabled. MCP-Atlas, which tests tool use across the Model Context Protocol ecosystem: 71.8. Tool-Decathlon, a broad test of tool calling across ten different categories: 40.7.

These agentic numbers tell a story about what Z.ai was optimizing for. This is not a model that was tuned to look good on static question-answer benchmarks. It was designed to work inside agent loops, calling tools, managing state, and persisting through long task sequences.

The benchmark caveat#

I need to be direct about something. These SWE-Bench Pro scores are self-reported. Z.ai ran the evaluation themselves. The benchmarks have not been independently verified by a neutral third party as of this writing. That does not mean the numbers are wrong. It means you should treat them with the same skepticism you would apply to any vendor-reported benchmark. Independent reproduction will take a few weeks. By the time you read this, the community may have already confirmed or challenged these results.

The Agentic Endurance Angle#

Here is where GLM-5.1 does something that no other model, open or closed, has demonstrated at this level.

Z.ai designed GLM-5.1 for what they call "long-horizon autonomous operation." The model can sustain 8-hour autonomous coding sessions. Not 8 hours of chat. Eight hours of continuous plan-execute-analyze-optimize loops where the model is making tool calls, reading results, adjusting its approach, and continuing without human intervention.

The showcase example they published is a vector database optimization task. The model ran over 600 iterations across more than 6,000 tool calls. It achieved a 6x performance improvement on the target database. To be clear about what that means: the model identified bottlenecks, hypothesized solutions, implemented them, measured the results, and iterated. Six hundred times. With minimal context degradation.

Another demonstration involved building a Linux desktop environment from scratch, entirely through browser-based interactions. The model handled package management, configuration, window manager setup, application installation. The kind of multi-step systems task that would normally require a human to babysit the agent through dozens of failure modes.

This is the real differentiator and it connects directly to what I have been writing about with long-running agent harnesses. The bottleneck for production agents has never been "can it solve a single problem." The bottleneck is endurance. Can the model maintain coherent state across thousands of tool calls? Can it recover from errors without losing track of the larger goal? Can it operate for hours without the quality of its reasoning degrading into repetitive loops?

Most models start to degrade significantly after a few hundred tool calls. The context window fills up, the model loses track of what it has already tried, and you end up with an agent that is technically running but practically spinning its wheels. GLM-5.1's architecture appears to handle this differently, though the mechanism is not fully documented in their paper. My guess is that the 200K context window combined with aggressive internal summarization is doing the heavy lifting, but Z.ai has not confirmed this.

What endurance means in practice#

If you are building agent scaffolding for production use, the endurance question changes what kind of work you can delegate. A model that degrades after 200 tool calls is useful for focused tasks: fix this bug, write this test, refactor this function. A model that maintains coherence across 6,000 tool calls is useful for projects: migrate this service to a new framework, optimize this entire data pipeline, build this feature end to end.

That is a qualitative shift. It is the difference between using an AI as an assistant and using it as an engineer who happens to work 24 hours a day.

The Huawei Training Story#

This is the part that will dominate the geopolitical analysis but that I think matters more for practical engineering reasons than political ones.

GLM-5.1 was trained entirely on Huawei Ascend 910B chips. No Nvidia hardware was used at any point during training. This makes it the first frontier-class model trained exclusively on domestic Chinese silicon.

The significance here operates on two levels.

On the industry level, this breaks the assumption that you need Nvidia H100s or B200s to train a competitive frontier model. The Ascend 910B is not as fast per chip as Nvidia's top hardware. But Z.ai compensated with what appears to be excellent distributed training infrastructure and a MoE architecture that is inherently more parallelizable than dense transformers. The result speaks for itself. Whatever efficiency they lost on a per-chip basis, they made up in system-level engineering.

On the practical level, this means GLM-5.1 is less exposed to supply chain disruptions than models that depend on Nvidia. If you are an enterprise team evaluating model providers, supply chain resilience is a real factor. Anthropic, OpenAI, and Google all depend on the same Nvidia hardware pipeline. Z.ai has an independent path. That diversification has value.

The export control angle is obvious but worth stating. US export restrictions on advanced chips to China were supposed to slow down Chinese AI development. GLM-5.1 is evidence that the restrictions motivated domestic hardware development rather than preventing frontier capability. Whether that changes the policy conversation is above my pay grade. What I can tell you is that the engineering achievement is real.

What this does not mean#

I want to be careful here. Trained on Ascend 910B does not mean "as efficient as Nvidia." Z.ai likely used significantly more chips, more time, and more energy to achieve the same result. The economics of training on domestic Chinese hardware are probably worse than training on Nvidia. But the capability ceiling is clearly not lower. And economics improve with iteration, while capability ceilings do not move as easily.

Pricing and the Economics#

The API pricing is where GLM-5.1 becomes impossible to ignore.

Input tokens: $1.00 per million. Output tokens: $3.20 per million.

Let me put that in context with the models it is competing against on benchmarks.

Gemini 3.1 Pro charges $3.75 per million input tokens and $15 per million output. That makes GLM-5.1 roughly 3.75x cheaper on input and 4.7x cheaper on output.

GPT-5.4 charges $4.70 per million input and $15.00 per million output. GLM-5.1 is 4.7x cheaper on input.

Claude Opus 4.6 charges $15 per million input and $75 per million output. GLM-5.1 is 15x cheaper on input and over 23x cheaper on output.

These are not small differences. For teams running agents that make thousands of tool calls per session, the cost gap between GLM-5.1 and Claude Opus 4.6 is the difference between "we can afford to let the agent run" and "we need to add hard cost caps that compromise quality."

If you have been reading my posts about prompt caching architecture, you know that cost optimization at the token level is one of the highest-leverage engineering decisions for agentic systems. GLM-5.1's pricing makes some of those optimizations unnecessary. When your base price is this low, the marginal cost of an extra few thousand tokens of context is negligible.

The self-hosting option#

Because the weights are open, you can eliminate API costs entirely by running GLM-5.1 on your own infrastructure. The 744B total parameter count means you need significant hardware, but the 40B active parameter footprint makes inference more tractable than a dense 744B model. Teams with existing GPU clusters can run this model without new hardware procurement, depending on their setup.

Z.ai also offers a "Coding Plan" starting at $3 per month. The details on what that includes are still sparse, but the positioning is clear: they want individual developers using the model, not just enterprises with API budgets.

The cost caveat#

Low pricing at launch does not guarantee low pricing forever. Z.ai is a venture-funded company that needs to build a business. These prices could be introductory, designed to capture market share before increasing. If you are making architectural decisions based on GLM-5.1's pricing, build in the ability to switch models. Which, given that the weights are open, you can do by self-hosting.

Where It Falls Short#

I would not be doing my job if I only covered the highlights. GLM-5.1 has real limitations.

No multimodal support#

This is text-only. No image understanding, no diagram parsing, no screenshot analysis. If your agent workflow involves looking at UI screenshots to verify changes, or parsing diagrams from documentation, GLM-5.1 cannot do it. For pure code and text tasks, this does not matter. For full-stack agentic workflows that interact with visual interfaces, it is a hard constraint.

Science reasoning trails Claude#

GPQA-Diamond at 86.2 versus Claude Opus 4.6 at 91.3. That is a meaningful gap on graduate-level science questions. If your use case involves scientific reasoning, medical analysis, or complex domain-specific Q&A, Claude still has the edge. For coding and software engineering tasks, this gap is less relevant. But it suggests that GLM-5.1's training optimization leaned heavily toward code at the expense of some breadth.

Kernel optimization performance#

In benchmarks testing low-level kernel and systems optimization, GLM-5.1 achieved a 3.6x improvement factor compared to Claude's 4.2x. This is a specialized metric, but it matters if you are working on performance-critical systems code. The model is good at systems work but not the best.

Chinese-first documentation#

Z.ai's documentation, blog posts, and technical papers are primarily in Chinese with English translations that are sometimes rough. The model itself works fine in English. But if you need to debug an issue with the model's behavior or understand a nuance of its training, you may find yourself running documentation through a translator. This is a friction cost that adds up.

Benchmark skepticism#

I mentioned this already but it bears repeating. SWE-Bench Pro scores are self-reported. The agentic endurance claims, while demonstrated in published videos, have not been independently verified under controlled conditions. The 6,000 tool call claim is impressive. It is also a claim. I want to see independent teams reproduce these results before I build production infrastructure around them.

What This Means for Production Agent Stacks#

Let me shift from analysis to practice. If you are building agentic systems today, how should you think about GLM-5.1?

Framework compatibility#

GLM-5.1 is already supported by vLLM and SGLang for self-hosted inference. Z.ai's API is OpenAI-compatible, which means it works with most existing agent frameworks out of the box. They have also announced compatibility with Claude Code (as a model backend) and OpenClaw, the open-source agent framework that has been gaining traction.

If you are running a multi-agent system, GLM-5.1 slots in as a worker model without significant integration work. The OpenAI-compatible API means your existing orchestration code, tool definitions, and prompt templates carry over.

The context degradation question#

The real question I have, and that I have not seen answered satisfactorily, is about context degradation at scale. Z.ai claims coherence across 6,000+ tool calls. But what does "coherence" mean precisely? Is the model making novel decisions at tool call 5,500, or is it following established patterns from the first 500 calls? Is it still capable of course-correcting when it encounters unexpected results late in a session?

These are the questions that matter for production. A model that maintains state but loses adaptability is useful for repetitive tasks but dangerous for open-ended ones. I plan to run my own endurance tests once I have access to a self-hosted instance, and I will publish the results.

The build-vs-buy equation#

GLM-5.1 changes the economics of the build-vs-buy decision for enterprise teams. Previously, frontier model access meant paying Anthropic, OpenAI, or Google per token, forever. The combination of open weights and competitive performance means you can now run a frontier-class coding model on your own infrastructure.

The cost of self-hosting a 744B MoE model is not zero. You need the hardware, the MLOps expertise, and the willingness to manage inference infrastructure. But for teams that are already spending six figures monthly on API calls for agent workloads, the payback period on self-hosting could be measured in months, not years.

This does not mean everyone should self-host. For most teams, the API at $1.00/$3.20 per million tokens is the right choice. But having the option changes the negotiating dynamic with every other model provider. When your fallback is "we will just run the open model ourselves," pricing conversations go differently.

Model routing gets more interesting#

The practical takeaway for most teams is not "switch everything to GLM-5.1." It is "add GLM-5.1 to your model router." If you are already doing intelligent routing between models based on task type, cost sensitivity, and latency requirements, GLM-5.1 becomes the obvious choice for long-running coding tasks where cost and endurance matter more than breadth.

Route your coding agents through GLM-5.1. Route your science and multimodal tasks through Claude. Route your high-throughput simple tasks through whatever is cheapest. The future of production AI is not one model. It is the right model for each task.

What I am watching for#

Three things will determine whether GLM-5.1 is a permanent fixture in production stacks or a benchmark curiosity.

First, independent benchmark verification. The community needs to reproduce these SWE-Bench Pro results. If they hold up, this is real. If there is a significant gap between self-reported and independent scores, the story changes.

Second, production reliability at scale. Benchmarks test capability. Production tests reliability. How does the model behave when you run it for thousands of hours across diverse codebases? What are the failure modes? Where does it get stuck? These answers only come from real-world deployment.

Third, Z.ai's business trajectory. An MIT-licensed model is only as stable as the company behind it. If Z.ai builds a sustainable business, the model ecosystem around GLM-5.1 will grow. If they struggle financially, development could stall. The weights are out there and cannot be taken back, but active development matters for long-term viability.

FAQ#

What is GLM-5.1?#

GLM-5.1 is a 744 billion parameter Mixture-of-Experts language model developed by Z.ai (formerly Zhipu AI). It uses 256 experts with 8 active at inference time, resulting in approximately 40 billion active parameters per forward pass. It supports a 200,000 token context window and up to 128,000 output tokens. The model is released under an MIT license with fully open weights.

Is GLM-5.1 open source?#

Yes. GLM-5.1 is released under the MIT license, which is one of the most permissive open-source licenses available. You can download the weights from Hugging Face, fine-tune them, deploy them commercially, and build derivative products without restrictions. The model weights, tokenizer, and configuration files are all publicly available.

How does GLM-5.1 compare to Claude Opus 4.6?#

On SWE-Bench Pro (software engineering), GLM-5.1 scores 58.4 versus Claude Opus 4.6 at 57.3. On GPQA-Diamond (science reasoning), Claude leads with 91.3 versus 86.2. Claude also edges ahead on kernel optimization tasks (4.2x vs 3.6x). GLM-5.1 is significantly cheaper at $1.00/$3.20 per million tokens versus Claude's $15/$75. Claude supports multimodal input while GLM-5.1 is text-only. For pure coding and agentic tasks, GLM-5.1 is competitive and cheaper. For breadth and multimodal capabilities, Claude remains stronger.

What hardware was GLM-5.1 trained on?#

GLM-5.1 was trained entirely on Huawei Ascend 910B chips, making it the first frontier-class model trained without any Nvidia hardware. This is significant both as a technical achievement and as evidence that frontier AI training is no longer dependent on a single hardware vendor.

Can I self-host GLM-5.1?#

Yes. The weights are available on Hugging Face and the model is supported by vLLM and SGLang for self-hosted inference. The 744B total parameter count requires substantial hardware, but the MoE architecture means only about 40B parameters are active during inference, making it more tractable than a dense model of similar total size. Self-hosting eliminates per-token API costs entirely.

How much does GLM-5.1 cost?#

Through Z.ai's API, GLM-5.1 costs $1.00 per million input tokens and $3.20 per million output tokens. This makes it 3.75x cheaper than Gemini 3.1 Pro, 4.7x cheaper than GPT-5.4 on input, and dramatically cheaper than Claude Opus 4.6. Z.ai also offers a Coding Plan starting at $3 per month for individual developers. Self-hosting with your own hardware eliminates API costs entirely.

What are GLM-5.1's limitations?#

The main limitations are: text-only input (no images, audio, or video), weaker science reasoning compared to Claude Opus 4.6 (86.2 vs 91.3 on GPQA-Diamond), Chinese-first documentation that can make troubleshooting harder for English-speaking developers, and self-reported benchmark scores that have not yet been independently verified. The model also trails Claude on kernel-level optimization tasks. For teams that need multimodal capabilities or the strongest possible science reasoning, GLM-5.1 is not the right choice.

Share:

Stay in the loop

New posts on AI engineering, Claude Code, and building with agents.