Labs

Scenic
RDAP/WHOIS matrix
Pouch
Safer secrets
Churn News
Perpetual satire
Choice Reviews
Rigorous testing

Services

Development
Web applications, mobile applications, Backend & distributed systems, API design & integration, database design & scaling
AI
Model training & fine-tuning, LLM application design, Agentic tooling & knowledge integration
Security
Penetration testing, Red team & adversary emulation, Attack surface discovery & exposure management
Infrastructure
Cloud architecture, Containerization & platform engineering, CI/CD pipelines & release engineering, Observability & SRE

Analysis

GLM‑4.5’s ARC Ambition

Bob Sacamano, Practice Lead
Aug 8, 2025

The arXiv paper describing GLM‑4.5 arrived on 2025-08-08 with an unusually broad claim: a single open Mixture‑of‑Experts (MoE) model family that competes across agentic tasks, mathematical and scientific reasoning, and real‑world coding. The submission lists affiliations with Zhipu’s research group and Tsinghua collaborators and presents two weights—GLM‑4.5 and a smaller GLM‑4.5‑Air—positioned as generalist models rather than specialist variants.

In parallel with the paper, the launch narrative on the Z.ai blog underscores three core assertions: competitive agentic browsing, high math and science scores under a deliberate “thinking” mode, and strong coding agents when paired with tool frameworks. The post also discloses the evaluation setup used for SWE‑bench Verified and Terminal‑Bench, details that matter because agent harnesses can materially affect scores.

The code and collateral live in the project’s GitHub repository, which provides inference scaffolding, quick‑start notebooks, and a readable architecture summary, while the open weights are mirrored on the team’s Hugging Face model page. One point worth flagging early is licensing: the repository advertises Apache‑2.0 while the readme text says the models are MIT‑licensed and the Hugging Face card also lists MIT, a discrepancy that readers should resolve per their compliance needs before redistribution or fine‑tuning. A reasonable interpretation is that code and weights may be under different terms, but the documents themselves are not fully explicit.

GLM‑4.5’s evaluation set prominently features agent benchmarks that simulate long‑horizon tool use. The paper reports results on τ‑bench via the original research introduction at the τ‑bench arXiv page, on multi‑turn function‑calling with Berkeley’s BFCL v3 blog, and on web navigation with OpenAI’s BrowseComp page, framing performance against proprietary baselines such as o4‑mini documented in OpenAI’s o3 and o4‑mini announcement. Benchmark choice and framing matter here because agentic scores are contingent on simulator prompts, tool adapters, and retry budgets; the paper does describe its user simulator for τ‑bench and caps for SWE‑bench runs, which improves interpretability relative to bare score tables.

Coding evaluations draw on live, repository‑level tasks and terminal control. For evolving competition problems, the authors cite LiveCodeBench and link its public interface at the LiveCodeBench site; for end‑to‑end command‑line tasks, they measure with Stanford and Laude’s Terminal‑Bench site. The latter is an agent harness as much as a dataset, so the model’s measured score is inseparable from the agent stack, retries, and the action vocabulary used during the run.

General‑ability and reasoning tests include a harder re‑formulation of MMLU at the MMLU‑Pro arXiv page and research‑coding workloads at the SciCode benchmark site, along with expert‑written science questions via a GPQA card at the Hugging Face GPQA dataset page and frontier‑difficulty academic questions from the Center for AI Safety at the Humanity’s Last Exam site. Across these, the paper often averages multiple stochastic samples (for example, Avg@32 on AIME and Avg@8 on GPQA) to stabilize estimates, which is an explicit attempt to control variance that otherwise plagues single‑sample reporting on hard sets.

On real‑project code repair, the team evaluates on the curated “Verified” subset of SWE‑bench introduced in the 2025 cycle at the SWE‑bench paper, and, for agent harnesses, they note use of OpenHands, whose repository is at the OpenHands GitHub page, as well as Terminus for uniform Terminal‑Bench execution via the Terminus page. These choices are nontrivial: OpenHands’ iteration limits, context window truncation, and temperature settings can shift pass rates by several points, so the explicit configuration helps third parties re‑run the experiments.

Architecturally, the paper positions GLM‑4.5 as a 355B‑parameter sparse MoE with 32B active parameters per token and a 106B‑parameter “Air” variant with 12B active parameters, paired with a long‑context training curriculum that expands from 4K to 128K tokens. The MoE design uses top‑8 routing with 160 experts and a single shared expert, and it layers an additional MoE block to implement multi‑token prediction used for self‑speculative decoding. The stack combines Grouped‑Query Attention as introduced at the GQA arXiv page with rotary embeddings described at the RoPE arXiv page and stabilizes attention logits via the QK‑Norm arXiv page. The authors also report a heavy increase in attention head count relative to width, a design that did not reduce training loss but improved reasoning benchmarks in their ablations, which comports with recent observations that scaling head diversity sometimes helps task‑level generalization even when perplexity is flat.

Training‑wise, the paper documents multi‑stage data curation totaling 23 trillion tokens, with mid‑training phases that up‑sample code, math, and science, extend context to 128K, and inject synthetic agent trajectories before post‑training. Optimization departs from AdamW in favor of the Muon family, with hyperparameters that include Newton–Schulz steps and momentum tuned for large batches; for background, Keller Jordan’s write‑up of Muon is a useful reference at the Muon blog post. These choices are consistent with a broader 2024–2025 trend toward orthogonalized updates and attention‑stabilizing norms to keep very‑deep, very‑wide stacks trainable at high batch sizes.

Post‑training is a two‑stage “expert model iteration.” In Stage 1, three specialized experts are produced for reasoning, agent, and general chat—each cold‑started with supervised fine‑tuning and then reinforced; in Stage 2, a unified hybrid‑reasoning model is distilled from the experts so that it can switch between explicit thinking traces and direct answers. The paper also proposes an XML‑tagged function‑calling template to reduce escape‑character burden for code‑heavy tool invocations, a pragmatic tweak that, while minor, addresses a common pain point in function‑calling agents where JSON‑encoded code blocks become unwieldy.

For inference speed, the system adds a multi‑token prediction (MTP) layer for self‑speculative decoding and cites compatibility with feature‑level speculation like EAGLE; EAGLE is formalized at the EAGLE arXiv page. The practical implication is that the model can produce and verify short runs of tokens per step without a separate draft model, improving tokens‑per‑second under acceptance‑rate constraints that depend on prompt structure and temperature.

The headline numbers cluster into three domains. On agentic tasks, the paper reports 70.1% on τ‑bench aggregated across airline and retail, 77.8% on BFCL v3, and 26.4% on BrowseComp—behind OpenAI’s o3 but ahead of Claude Opus 4 and near o4‑mini‑high. On reasoning, it presents 91.0% on AIME‑24 with 32‑sample averaging, 79.1% on GPQA with 8‑sample averaging, and 14.4% on HLE’s text‑only subset, consistent with a general pattern where frontier‑grade, expert‑curated questions remain difficult for all models. On coding, it claims 64.2% on SWE‑bench Verified and 37.5% on Terminal‑Bench with the specified harness and caps. These values, methods, and caveats are all spelled out in the paper and in the launch blog, and they track with contemporaneous reports where proprietary models place higher on BrowseComp and HLE while open models close gaps on BFCL‑like tool calling.

The benchmark choices themselves deserve scrutiny. τ‑bench originated as a controlled Tool‑Agent‑User simulator with strict policies and a pass^k reliability metric, and the original description at the arXiv page emphasizes how even top function‑calling agents struggled to clear 50% in 2024. BFCL v3 explicitly shifted from single‑turn to multi‑turn, multi‑step tools, raising the importance of consistent state tracking. BrowseComp was designed to test browsing agents’ ability to find entangled facts on the open web with short, checkable answers; that makes it particularly sensitive to tool use, allowed extensions, and general browsing competence rather than raw language modeling. Each of these design histories helps explain where a model like GLM‑4.5 might excel or underperform.

On the coding side, LiveCodeBench was constructed to reduce training‑set leakage by continuously collecting new contest problems and to broaden beyond pure code generation into test‑output prediction and self‑repair, while Terminal‑Bench is a live agentic environment whose scores depend on an agent’s planning, file‑system operations, and resilience to transient errors. If two systems report identical raw percentages but use different retry horizons or different sandbox adapters, their effective capabilities are not identical, which is why the GLM‑4.5 team’s disclosure of OpenHands version, iteration caps, and temperature is a material detail rather than a footnote.

The paper’s architecture section gives enough granularity to be technically credible: explicit MoE counts and routing, a comparison table against DeepSeek‑V3 and Kimi K2, and a rationale for “deeper‑not‑wider” trade‑offs that increase layers more than width. It also documents a long‑context curriculum, partial RoPE usage, and the addition of QK‑Norm, which, as the original formulation argues at the arXiv page, reduces arbitrary softmax saturation and stabilizes gradients. Combined with increased attention head diversity, these choices plausibly support the reported improvements in reasoning benchmarks without necessarily reducing pre‑training loss.

Data and optimization details are likewise unusually specific by contemporary standards. The authors describe web‑scale crawls bucketed by a Nemotron‑CC‑style quality model, SemDedup‑style near‑duplicate removal, and code selection that includes repository‑level concatenation plus Fill‑In‑The‑Middle objectives, followed by mid‑training that up‑samples code, math, and science and extends sequence length. They also publish concrete Muon hyperparameters and a cosine decay schedule, along with batch‑size warmups to 64M tokens. These line items do not by themselves guarantee reproducibility, but they narrow the space of unknowns relative to papers that treat data and optimizer as black boxes.

For reproduction and verification, the team released a lightweight evaluator at the GLM‑SIMPLE‑EVALS repository with task adapters for AIME, GPQA, HLE, LiveCodeBench, SciCode, and MMLU‑Pro, plus checker‑model guidance. That step matters because it encodes normalization around answer‑extraction, max new tokens, and choice of grader model, all of which can swing headline numbers if left implicit. As a baseline for math reproducibility, the paper’s AIME score is sensitive to the exact split used, and public variants exist; one example is a curated compilation maintained at the AIME 1983–2024 dataset, while a discussion thread notes differences between independent collections, as in the OpenCompass AIME discussion. Taken together, these references reinforce that small variations in question wording, formatting, or grading scripts can produce proportionally large swings at the top of the leaderboard.

There are, however, constraints and unknowns that the paper cannot fully eliminate. First, the hybrid‑reasoning switch between “thinking” and “direct response” is a training‑data and alignment outcome more than a hard architectural guarantee; evaluating when the model decides to think versus answer quickly would require controlled ablations not present in the public materials. Second, the choice to use a simulator in τ‑bench with an optimized user prompt is reasonable but introduces a distributional shift relative to the benchmark’s default user simulator, which can inflate or deflate relative performance depending on how other teams configure their agents. Third, BrowseComp’s design, by construction, favors models with robust browsing tools and discretion over web snippets; models with weaker browsing stacks will lean on reasoning alone and underperform even if their base language capabilities are strong. These caveats do not negate the reported results, but they bound their generality.

A brief note on organizational context helps explain the push toward open weights. Zhipu’s public materials at the Zhipu AI home page and the product‑facing Z.ai site emphasize enterprise adoption and an API ecosystem alongside open checkpoints. The dual presence—API for production workloads and permissively licensed weights for research and on‑premises deployment—aligns with a growing pattern among East‑Asian labs to retain cloud services while seeding community adoption via open repositories. That makes the earlier licensing mismatch more than a footnote: downstream integrators need to know whether derivatives must attribute or open‑source code changes and whether commercial redistribution is permitted.

Across the technical stack, the paper reads as a systems‑level attempt to harmonize three historically divergent goals. For agentic robustness, it leans on function‑calling templates, mid‑training on trajectories, and tool‑friendly prompting; for reasoning, it doubles down on long contexts, deep stacks, and stable attention; for coding, it prioritizes repository‑level exposure and the ability to act within constrained environments via agent harnesses. The upshot is not a single superlative but a balanced profile: competitive but not dominant in browsing; strong but variance‑aware in math and science; and solidly on the Pareto frontier for code repair versus parameter count. The open question is how these trade‑offs evolve as the community converges on common agent frameworks that reduce harness variance and as future versions improve browsing and long‑horizon planning without over‑reliance on chain‑of‑thought scaffolding.

If you came to the paper to decide whether GLM‑4.5 merits attention beyond press releases, the answer is yes, with careful reading of the footnotes. The model family is documented at a level that enables rough reproduction, the benchmark coverage is broader than most contemporaries, and the disclosed harness configurations are a step toward apples‑to‑apples comparisons. Where numbers are close—particularly on AIME and GPQA—variance controls and dataset provenance checks are warranted, and where browsing is concerned, comparing like‑for‑like tool stacks matters more than raw percentages. The most pragmatic way to interpret the work is as a strong open baseline for agentic, reasoning, and coding workloads that narrows some gaps with proprietary systems while leaving room for improvement in open‑web search.