News

Latest Articles + Event News

May 22, 2026

Written by Graham Toppin Co-chair, TMLS, Co-founder and Analyst at Peerlabs.ai.

Part 1: The Pattern

Apologies for not getting an AI in production out this past Monday. We had put together an extensive essay on logprobs and model / harness customization. A number of things happened to change our stance toward the end of the week.

Today, we’re going to discuss:

Current trends in Generative AI and ML in general
Possible implications to how we’re building systems
What you can do today to plan for these events

While we’re going to speculate a bit, we’ll try to ground this in provenance you can reason about.

There’s a lot here, so this will be a two part essay.

We’re going to publish the articles on customization later this week, focused on open-weight and local models given the changes we’re seeing in Frontier models.

Let’s dive into it!

BLUF (Bottom Line Up Front) What you should do

To sum up:

Generative AI is useful.
The technology is improving, though at a reduced rate.
The technology is also commoditizing, the financial structure supporting it is under stress, and the vendors are responding by restricting practitioner control.
These things are all true simultaneously, and they all point in the same direction: build your AI practice on foundations you control.

If you read nothing else, read this section.

Optionality is the strategy

Frontier models’ value propositions are shifting from “irreplaceable capability” to “convenient, well-integrated, professionally supported.”

This is for a few reasons:

the technology is commoditizing: the product gets cheaper and better, the margins compress, and Frontier labs are less concerned about technology access and more about ecosystem lock-in.
For practitioners, the structural implication is straightforward: optionality is your strategy.
If you are building deep dependencies on a single frontier provider’s platform (e.g. fine-tuned models that can’t be ported, logprob-dependent confidence pipelines that break on migration, harness-specific workflows that don’t transfer) you are accumulating switching costs.
Those switching costs may not be justified by the capability premium (i.e. how much better the Frontier Labs’ models are) between (proprietary) Frontier models and other models (e.g. open-weight and open-source models), especially as that premium narrows.
If you can, maintain or create the ability to move between providers, or between proprietary and open-weight, so you are better positioned regardless of which scenario plays out.
If commoditization continues (we consider this the most likely), you can follow the cost curve down to open-weight models without rebuilding your infrastructure.
Similarly, if a frontier lab achieves a genuine breakthrough (we consider this very unlikely), you can adopt it without being locked out by switching costs.
If the financial structure corrects (less likely but plausible – see our discussion below), you are not dependent on a single provider whose business model may be under stress.

More concretely, this means:

Use OpenAI-compatible API endpoints wherever possible. Novita.ai, OpenRouter, Together, and most open-weight serving frameworks support the same API schema. Switching models means changing one environment variable, not rewriting your integration. Note, Anthropic API endpoints are also viable; the ACP / UCP / MCP standards are still in flux, with no clear winner as of this writing.
Build confidence pipelines on signals you control. If you need logprobs, use open-weight models via vLLM or Ollama. If you need fine-tuning, use open-weight models with mature tooling. Don’t build critical infrastructure on capabilities a vendor can remove with a product update. We will explicitly discuss how you can do this in future articles.
Treat the harness as a layer you own. What we hear from practitioners is consistent and unambiguous: what distinguishes productive from unproductive use of agents is the workflow architecture around the harness, not the harness itself. Context externalization, structured checkpointing, scoped sessions, and observability are all practices that transfer across harnesses and providers. A good rule of thumb to continuously check is how easily you can move a workflow from one provider to another. Being able to do this is also essential to our next point – evals and benchmarks.
Measure on your workload, not on benchmarks. Benchmarks are definitionally imperfect. They can be easily compromised by contamination, saturation, and selective reporting. Our recommendation is to build substantial and substantive domain-specific test cases covering your actual failure modes. This is the evaluation practice that consistently works. Treat all evals as triage not as ground truth. Develop infrastructure to increase your understanding and control of systems over time.

If you’re interested in how we came to these conclusions, read on!

Programming note: TMLS Conference lineup is live. Ion Stoica (Berkeley; creator of Ray, vLLM, Spark) keynoting in Toronto, plus 60+ practitioners shipping the exact systems this essay describes.

View Agenda

What we’re seeing

The TMLS Steering Committee tracks and actively discusses trends in AI and Machine learning. In the last few weeks, an unanticipated structural pattern has emerged, surfaced through the accumulation of evidence over eight weeks of tracking.

This pattern has three layers:

Technology
Finance
Control

The order is important and the connection between them matters (Technology–>Finance–>Control). The interaction between these three explains a lot of what we are experiencing right now. We think you may be seeing the same thing.

At the technology layer, Frontier Model capabilities are plateauing and the gap between proprietary and open-weight models is narrowing. This is commoditization, and is the same trajectory every maturing technology follows.

Commoditization creates pressure at the financial layer. Frontier Labs and their investors have committed hundreds of billions in capex on the assumption of sustained pricing power and exponential growth. If the technology is commoditizing, the margins to service those commitments compress; and the quantum of spend compressed into this narrow a window makes the payback math unforgiving.

This financial pressure drives behaviour at the control layer. When the technology itself is no longer a durable moat, the rational response is to build non-technical moats: to restrict access, remove practitioner control levers, and create switching costs. This is what we are observing across the major Frontier Labs right now.

The practitioner implication flows from the same chain in reverse: if control is being restricted, if the financial structure is fragile, and if the technology is commoditizing, then building deep dependencies on a single frontier provider carries increasing risk. When we discuss optionality, we are referring to the ability to move between providers and between proprietary and open-weight.

Given all of the above, we believe optionality is the appropriate response.

In this post (Part 1), we’ll cover the technology and financial layers. In Part 2, we’ll cover the control layer, the open-weight counter-narrative, and our confidence assessments.

Terminology note: throughout this document, we’ll be referring to the “open-closed gap” to refer to the gap between closed (frontier) models and open models.

The technology layer: capability is a commodity

Frontier model capabilities have been showing signs of plateauing, arguably for about 12-18 months. The improvements are real but incremental. The gap between “frontier” and “good enough” is narrowing faster than the frontier is advancing.

Widely cited is Claude Opus 4.5, representing a “watershed” moment for the usability of Generative AI in coding. This is correct, however when you look more closely at benchmarks (and yes, benchmarks are flawed) the picture becomes clearer: We have been approaching asymptotic improvement for a while, but Opus 4.5 represented a threshold being crossed, not a fundamental change in the trajectory of improvement.

Consider:

DeepSeek V4-Pro (MIT license, 1.6T total / 49B active) benchmarks between GPT-5.2 and GPT-5.4 on reasoning tasks at roughly one-seventh the price (Simon Willison; MIT Technology Review).
Qwen 3.6’s 9B parameter model outperformed GPT-OSS-120B on GPQA Diamond using a hybrid architecture. It can run on a phone (GitHub).
Gemma 4’s 31B dense ranks #3 on Arena AI. Apache 2.0 licensed. It can run quantized on consumer GPUs (Google AI Blog, April 2; HuggingFace).
Nathan Lambert’s sustained analysis at Interconnects (Feb-April 2026) hypothesizes the open-closed gap is roughly 6 months. An important nuance that open models keep crossing meaningful capability thresholds even if the absolute frontier stays ahead.
Apple’s decision not to invest in building a competitive foundation model from scratch may turn out, in retrospect, to be one of the wisest strategic moves in the GenAI age. Apple has chosen to buy Gemini’s capabilities and wrap them in Apple’s own integration layer (The Information, March 25; Bloomberg, January 12). This is telling because Apple has essentially an unlimited R&D budget, yet has chosen to be more circumspect in investing in a frontier model.
Benchmark “gaming” further complicates our understanding of model performance and capability: contamination, saturation, and selective reporting mean the benchmarks used to measure the gap are themselves unreliable. Andrej Karpathy has said: “Training on the test set is a new art form.” The GSM1K replication showed up to 13 percentage points of overfitting in some model families (Scale AI / GSM1K; Benchmark Data Contamination of Large Language Models: A Survey). If the gap is measured with compromised instruments, the real gap may be larger or smaller than reported, and more likely smaller given the rewards involved.

So, what does all of this mean? The most probable short-to-medium term outcome is commoditization of the model layer.

We need to be clear: this is not meant to be a prediction of collapse or irrelevance. It is more than likely the normal trajectory of a maturing technology. Commoditization will mean better prices for consumers and practitioners, but a margin-constrained (or margin-compressed) business for providers.

And margin-constrained business models matter because of what it does to the financial layer.

The financial layer: the spending is real, the revenue is not (yet)

Many AI skeptics and apologists are debating the merits of the technology. We would argue the more meaningful challenge to Generative AI is making the economics make sense.

What makes our current moment unusual is the large amount of upfront and planned investment in Generative AI and the implications in the likely scenario of it not reaching expectations.

Microsoft, Meta, Alphabet, Amazon, and Oracle collectively plan to spend $630-700B+ in AI infrastructure in 2026, an eye-watering figure rivalling Sweden’s GDP (Tech Insider).

Morgan Stanley estimates ~$2.9 trillion in global data centre construction through 2028, with 80%+ of spending still ahead (Morgan Stanley, April 2026).

Revenue has not kept pace. Deloitte’s State of AI in the Enterprise report finds productivity gains are positive but modest, with implied labour productivity growth of roughly 0.8% in high-skill services (Deloitte, 2026).
Revenue growth from AI “largely remains an aspiration.” 74% of organizations hope to grow revenue through AI, but only 20% appear to be doing so.
BlackRock’s Q2 2026 outlook is explicit: “The AI builders are leveraging up — investment is front-loaded while revenues are back-loaded. Along with highly indebted governments, this creates a more levered financial system vulnerable to shocks — including bond yield spikes.” (BlackRock, April 22).

Why is this important? Front-loaded investment with lagging revenue is normal for infrastructure CapEx. What is not normal is the quantum of spend compressed into this narrow a window. $630-700B in a single year makes the emphasis on cash flow and payback acute, and a credible analysis of payback timelines at current revenue trajectories is noticeably absent from the public discourse, including from the labs themselves.

At the same time, physical infrastructure is under stress from outside the technology sector:

The 2026 Iran conflict has pushed Brent crude past $100/barrel, disrupted 20% of global oil trade and 19% of global LNG trade, and resulted in the first-ever military strikes on data centres (Columbia CGEP, April 16; CNBC, March 11).
Oracle’s debt situation ($124.7B in long-term debt, CDS spreads at all-time highs, banks hitting single-counterparty exposure limits) is the most visible sign of financial fragility in the AI infrastructure build-out; however, the structural pattern extends beyond a single company (Benzinga, April 24; Motley Fool, April 10).

This has created or exacerbated a cost inversion already visible at the operational level. Bryan Catanzaro, NVIDIA’s VP of Applied Deep Learning, told Axios in May 2026: “For my team, the cost of compute is far beyond the costs of the employees.” (Fortune, May 2026) ‘nuff said.

This follows an emerging picture in the data:

Uber engineers using Claude Code have already blown through the company’s entire 2026 AI budget (Uber CTO Shows How Claude Code Can Blow Up AI Budgets).
A Stockholm software engineer told the New York Times: “I probably spend more than my salary on Claude.”
Jensen Huang proposed giving engineers token budgets equal to roughly half their base salary, framed as a recruiting incentive (Axios via Futurism, May 2026).
Meta’s internal “Claudeonomics” leaderboard ranked employees by token consumption, awarding titles like “Token Legend” to the top user (281 billion tokens in 30 days)(The Information, April 6). The leaderboard was removed two days after The Information reported on it (The Information, April 8).
Jellyfish data on 7,548 engineers found that the largest token budgets produced 2x throughput at 10x the token cost. GitClear found AI users averaged 9.4x higher code churn. Faros AI reported code churn increases of 861% under high AI adoption (TechCrunch, April 17; Jellyfish: Is tokenmaxxing cost-effective? New data from Jellyfish explains).
Meta engineers admitted to inflating token usage to avoid being tagged as insufficiently “AI-native” (Pragmatic Engineer, April 2026).

An analysis of SEC filings across 32 companies that publicly linked layoffs to AI between 2023 and Q1 2026 found operating margins declined or held flat at every company that buys AI. The study found only companies selling AI infrastructure have improving margins. Further, only one company out of thirty-two, Salesforce, showed genuine, measurable improvement where margin gains, headcount reductions, and a named AI product all aligned. The authors’ framing: the payroll savings at AI-buying companies are becoming revenue at AI-selling companies (Read Uncut, May 18). Heads up: the framing of the data in this study is more polemic than we would prefer; however the data are strong evidence of the stress the frontier labs are under.

The thesis “cheap compute replaces expensive humans”, is generating a lot of uncertainty, and it is currently running in reverse at the organizations purchasing and building out Generative AI product and operations.

This does not mean the economics will never work. (E.g. inference costs are falling (DeepSeek V4-Flash at $0.14/M input; Qwen3 Coder at $0.07/M on Novita.ai)), and the open-weight cost structure is dramatically cheaper than frontier proprietary.

But it does mean at the moment the cost structure has not caught up to the adoption curve; and the organizations most aggressively adopting AI tools are the ones feeling this margin pressure the most.

More specifically:

Frontier labs and Big Tech companies are depending on an enormous ROI for their GenAI investments
If this turns out not to be the case, the disruption ultimately will not be as significant and lucrative as initially planned
If Anthropic and OpenAI turn out to be 10-40B ARR companies in the next 5-10 years, they will be considered to be failures.

An important dissenting opinion is Anthropic’s recent $30B revenue run rate. We treat this number with care, however, given it reflects growth based on the same pricing model and usage changes we have seen disrupt Anthropic’s user base; however, this is a narrative worth considering when reading this essay.

This means there are perverse incentives at play, which leads to our discussion of the control layer.

In Part 2, we’ll examine how frontier labs are responding to these pressures, why open-weight models are gaining a structural advantage beyond capability, and what we can say with varying degrees of confidence about where this is heading.

This note reflects our analysis as of May 2026. All claims are sourced from public reporting and research referenced in our TMLS Steering Committee notes. The scenarios described are assessments of relative probability, not predictions. We expect to be wrong about some of them and will update accordingly.