News

Latest Articles + Event News

May 25, 2026

Written by Graham Toppin Co-chair, TMLS, Co-founder and Analyst at Peerlabs.ai.

Part 2: The Response

Welcome back! In Part 1 of this essay, we covered the technology and financial layers: model capabilities are commoditizing, and the financial structure supporting frontier labs is under stress. That stress comes from compressed capex timelines, rising energy costs, and a cost inversion where compute exceeds the humans it supplements. This creates perverse incentives at the control layer.

The pattern is Technology→Finance→Control. If you haven’t read Part 1, start there. If you have, here’s where the pressure manifests.

The control layer: how to build a moat where none exists

When the technology itself is commoditizing and the financials are under pressure, companies will try to build moat from somewhere else.

This pattern became exceedingly clear in the last couple of weeks, and is the reason our logprobs and fine-tuning essays have been delayed.

Consider:

Fine-tuning is effectively dead on proprietary frontier models. OpenAI announced on May 7 the wind-down of its self-serve fine-tuning API, with the training pipeline closing January 6, 2027 (OpenAI Developer Forum, May 8;Discussion thread). OpenAI was the last to offer it. Anthropic has never provided fine-tuning through its own API; the only fine-tunable Claude model is Claude 3 Haiku (two generations old), available exclusively through Amazon Bedrock in a single AWS region. With the exception of Google Vertex, no current-generation proprietary model from any major lab supports self-serve fine-tuning.
Logprobs have been removed from frontier reasoning models. OpenAI stripped logprobs from all reasoning models (o1, o3, GPT-5, GPT-5-mini). The Responses API omits them entirely. Practitioners migrating from GPT-4o to GPT-5-mini have had to remove logprobs from their payloads. This is a breaking change in production confidence pipelines. Anthropic has never exposed logprobs. Google is the exception, expanding logprob support on Gemini/Vertex AI, perhaps as a counter-positioning strategy.
Harness access is being restricted. Anthropic cut third-party harness access from Claude Pro and Max subscriptions on April 4 2026, and provided clarity on their position on May 13. OpenClaw, the fastest-growing open-source project in history (375 GitHub stars), was named explicitly. Users routing their Claude subscription through any third-party agent framework must now pay separately via API rates. The backlash has been quick, and severe. Time will tell how deep this goes.
The direction is consistent across labs: “Use our model as-is, through our harness, with our defaults.” Two of the primary levers practitioners used to inspect and customize model behaviour (logprobs for confidence signals and fine-tuning for domain adaptation) are both gone from the proprietary frontier. Harness choice is increasingly being restricted. One of the motivations is the frontier labs are hoping each removal will create switching costs and deepen platform dependency.

We would argue these are not arbitrary product decisions. They are rational responses to the financial and competitive pressures we’ve been discussing. This is not to dismiss other competitive pressures the labs are facing. Distillation from competitors, harness use breaking pricing models where some users subsidize others, are also at play here. But in our opinion this is further evidence of how difficult the economics of the frontier labs are.

If model capability is commoditizing, the technology itself is not a durable moat.

Platform lock-in making it costly and difficult for practitioners to leave is part of the frontier labs’ strategy, though at this point it isn’t obvious this is a workable strategy.

The hope is workflows built around proprietary harness features don’t easily transfer. But it isn’t obvious any of the attempted tactics will work, given the strong open source tooling in place, and the incentive to build resilient structures and practices given the unrealistic costs of Frontier tooling.

The open-weight counter-narrative

The “control lever” removal creates a structural advantage for open-weight models that benchmarks don’t capture.

Actions by the Frontier Labs mean the question is no longer only “can the open model match the closed model on MMLU?”

Instead it is: “can I do things with the open model that the closed model’s vendor won’t let me do at all?”

As of May 2026:

Fine-tuning: Available on all major open-weight models (Gemma 4, Qwen 3.6, DeepSeek V4, Llama), with mature tooling (Unsloth, Axolotl, TRL, Ludwig) is not available on any current-generation proprietary frontier model other than Google Gemini (though the effectiveness of their APIs is disputed).
Logprobs: Fully available on open-weight models via vLLM and other serving frameworks. Not available on proprietary reasoning models (except Gemini).
Harness choice: Open-weight models work with any harness — OpenClaw, opencode, pi, hermes, aider, Cursor, or custom orchestration. No vendor restrictions on how you invoke the model.
Cost: DeepSeek V4-Flash at $0.14/$0.28 per million tokens. Qwen3 Coder at $0.07/$0.07 on Novita.ai. Gemma 4 26B MoE at $0.06/$0.33 on OpenRouter. Versus $5/$30 for GPT-5.5 or $5/$25 for Opus 4.7 (before the 35% tokenizer inflation). The delta is 40-200x depending on the comparison. (Pricing sourced from provider pages and OpenRouter, verified May 4, 2026.)
Data sovereignty: Nothing leaves your infrastructure if you self-host. For regulated industries, this is a requirement. (Plug: in Peerlabs’ inference infrastructure study, 4 of 5 self-hosted deployments were privacy-motivated)

The open-weight advantage is no longer primarily about capability or even cost. It is about control: the ability to inspect, customize, and deploy models on your own terms, without the risk that a vendor product decision breaks your production pipeline.

What we can say with varying degrees of confidence

So, what does all of this mean? We’ll try to break it down without being sensational or pejorative.

Things we can say with more certainty:

Model capabilities at the frontier are plateauing. The improvements are real but diminishing relative to the investment required to achieve them.
The gap between “frontier” and “good enough” is narrowing. Open-weight models are crossing meaningful capability thresholds with increasing frequency.
Commoditization of the model layer is the most probable medium-term outcome. This will most likely mean better prices for practitioners but margin compression for providers.
Frontier labs will continue to respond to commoditization pressure by building non-technical moats: platform dependency, restricted access, removed control levers.
If the cost of AI-assisted work continues to exceed or come close to a clear multiple the cost of the human work it supplements, or promises to replace, the economics don’t make sense. They may improve, but they don’t appear favourable today at scale.

Things with less certainty, but worth tracking:

The financial structure underlying the frontier lab buildout appears more fragile than it was 12 months ago. Oracle’s debt, the energy price shock, free cash flow compression across big tech, and the Anthropic IPO speculation all point in the same direction. An industry correction more severe than normal market adjustment is not the base case, but it is more plausible now than at any comparable point in the last 18 months.
Whether Google’s counter-positioning (model-agnostic platform, logprob support, TPU silicon optimization) successfully absorbs multi-cloud workloads is also unclear, as is whether their Agent Platform is too early for the market.

Things we consider extremely unlikely:

Transformative ASI/AGI on near-term timelines (in the next 5 years) seems like a distant possibility. The METR time-horizon data is useful here (METR, updated May 8). The differences between the 50% and 80% curves in their latest study are revealing. The 50% curve frequently cited in headlines means the agent fails half the time on tasks of that difficulty. The 80% curve, closer to, but still not at, operational reliability, is substantially lower. METR itself notes that “measurements above 16 hours are unreliable with our current task suite,” and the task distribution is trimodal (the model either nails it or completely fails, with less middle ground than the logistic fit implies). The capability gains are real but they are gains in task completion on well-specified, low-context software problems, not general intelligence. The vendor (i.e. Frontier Labs) narrative requires the AGI framing to justify the CapEx. The practitioner evidence does not support it.
Dario Amodei’s claim that “coding is going away first, then all of software engineering” (April 25, 2026) has not aged well. Grady Booch attributed this to IPO motivation (X post, 158K views). Gergely Orosz tested it empirically and found AI-assisted coding amplifies existing expertise and fails where expertise is absent (X post, 743K views). The Faye “supervision paradox” (May 3, 391 HN points) adds the skill-atrophy dimension: developers need strong coding skills to supervise agents, but using agents causes those skills to decay. The research on Chain of Thought faithfulness further weakens the claim: if a meaningful fraction of visible reasoning is post-hoc rationalization rather than causal computation (arXiv:2604.15726; arXiv:2603.22582), the “AI understands software engineering” narrative is less well-supported than the fluency of the output suggests.

Thanks for reading this far! This is a complex picture, and finding the most likely scenario, ignoring both the AI optimists and pessimists is difficult. We hope we’ve managed to do so.

We would urge you to follow the four recommendations we’ve made:

Use OpenAI-compatible endpoints
Build confidence pipelines on signals you control
Treat the harness as a layer you own
Measure on your workload, not on benchmarks

If you need any help or advice, weigh-in in the comments or reach out directly. Our community are willing and able to discuss and to help!

This note reflects our analysis as of May 2026. All claims are sourced from public reporting and research referenced in our TMLS Steering Committee notes. The scenarios described are assessments of relative probability, not predictions. We expect to be wrong about some of them and will update accordingly.