Do You Really Need GPT-5? The Densing Law Says Probably Not

By Prahlad Menon 6 min read

There’s a moment in every technology cycle when the conventional wisdom flips. For a decade, the AI story has been simple: bigger is better. More parameters. More data. More compute. The scaling laws promised that if you just kept climbing, intelligence would emerge at the summit.

But what if the summit is a mirage?

A paper published in Nature Machine Intelligence this month introduces a concept that should make every enterprise AI team reconsider their vendor contracts: the densing law. The finding is elegant and disruptive—model capability density doubles approximately every 3.5 months. In plain English: a model with half the parameters can achieve equivalent performance to today’s state-of-the-art in just over a quarter.

The Density Revolution

The research team, led by Chaojun Xiao and colleagues, analyzed 51 open-source pretrained models released since Llama-1 and discovered something remarkable. They introduced “capability density”—essentially, performance per parameter—as a metric for comparing models across scales. When they plotted maximum density against time, the trend was unmistakable: exponential growth.

Consider the numbers. Llama-1, released in February 2023, had a density below 0.1. Recent models like Gemma-2-9B and MiniCPM-3-4B approach a density of 2. That’s a 20x improvement in under two years. The researchers provide a striking example: MiniCPM-1-2.4B, released in early 2024, achieves comparable or superior performance to Mistral-7B from late 2023—with only 35% of the parameters.

Let that sink in. Four months. One-third the size. Same results.

The API Price Collapse

If the densing law describes algorithmic efficiency, the market has already priced in its implications. In December 2022, GPT-3.5 cost $20 per million tokens. By August 2024, Gemini-1.5-Flash delivered superior performance at $0.075 per million tokens—a 266x reduction. The researchers estimate inference costs halve every 2.6 months.

This isn’t incremental progress. It’s the kind of exponential curve that destroys business models built on scarcity assumptions. Every enterprise locked into long-term contracts with frontier model providers should be running the numbers on what their commitment will be worth in 12 months.

Why Bigger Isn’t Always Denser

Here’s where it gets counterintuitive. Llama-3.1-405B is among the most capable open-source models available. But it’s not the densest. The paper explains why: “Constrained by computational resources and the scale of pretraining data, we usually cannot fully optimize the training settings for extremely large models, making them suboptimal in terms of cost-effectiveness.”

Translation: the giants are undertrained relative to their size. The sweet spot for density appears to be in the 2B-9B parameter range, where models can be thoroughly optimized given current compute constraints.

The paper also deflates a common assumption about model compression. Pruning, distillation, and quantization—techniques meant to shrink large models while preserving performance—often decrease density rather than improve it. The compressed versions of Llama-3.1-8B all showed lower density than the original. Only Gemma-2-9B, distilled from its 27B sibling, bucked the trend. Efficient compression, it turns out, is harder than training a dense small model from scratch.

The Healthcare Imperative

This research lands at a pivotal moment for healthcare AI. The sector has been wrestling with a fundamental tension: large language models show remarkable potential for clinical decision support, but their deployment requirements—cloud dependence, latency, privacy concerns, cost—clash with healthcare’s operational realities.

Small language models resolve this tension. A comprehensive survey published on arXiv tracks the rise of SLMs in healthcare, documenting how models in the 1B-7B range are achieving clinical performance that approaches or matches their larger counterparts on tasks from medical NLP to diagnostic support. The key insight: domain-specific fine-tuning on smaller models often outperforms generic large models, particularly when the task is well-defined.

For health systems evaluating AI deployments, the implications are profound. A 3B parameter model running on-premise—private, fast, and cheap—may outperform a 70B parameter API call that introduces latency, compliance risk, and unpredictable costs. The research from institutions like Cleveland Clinic and NHS partners has consistently shown that task-specific small models, properly tuned, deliver production-grade results.

The Convergence: Densing × Moore’s

The paper’s most striking prediction involves combining the densing law with Moore’s law. If capability density doubles every 3.5 months and chip computing power doubles every 2.1 years, the maximum effective model size deployable on consumer hardware doubles every 88 days.

Three months. That’s how long before your smartphone can run a model as capable as what required a data center today.

This isn’t futurism—it’s the logical extension of two well-documented trends. Apple Intelligence, running local models on iPhones, is the vanguard. By 2027, the paper suggests, high-quality LLMs will run efficiently on edge devices as a matter of course.

What This Means for Practitioners

If you’re building AI systems today, the densing law suggests a few strategic shifts:

Rethink vendor lock-in. Long-term commitments to frontier model providers look increasingly risky when equivalent capability will cost 10x less in a year. Consider architectures that allow model swapping.

Invest in fine-tuning infrastructure. The density advantage of smaller models is amplified by task-specific training. Building the capability to fine-tune and deploy custom small models may be more valuable than API credits.

Watch the 2B-9B range. This is where density optimization is most active. Models like Gemma-2-9B, MiniCPM-3, and Phi-3 represent the bleeding edge of efficiency.

Revisit compression assumptions. If you’re planning to distill a large model into a smaller one, the evidence suggests you might achieve better results training a dense small model from scratch.

The Scaling Law’s New Chapter

The scaling laws haven’t been invalidated. Larger models, properly trained, still outperform smaller ones. But the densing law reveals that we’ve been asking the wrong question. The question isn’t “how big can we make it?” but “how much capability can we pack into a given parameter budget?”

The researchers note that the rate of density improvement accelerated 50% after ChatGPT’s release—from a doubling time of around 5 months to 3.5 months. The AI community’s intense focus on efficiency, sparked by the commercial potential ChatGPT revealed, is paying dividends that benefit everyone.

For enterprises, the message is clear: the era of assuming bigger is better is over. The models that will win aren’t the largest—they’re the densest. And density, unlike size, is getting cheaper by the month.


The densing law research was published in Nature Machine Intelligence (Xiao et al., 2025). The healthcare SLM survey is available on arXiv (Garg et al., 2025).