For a year now, the dominant strategy for making AI smarter has been brute force. More layers, more parameters, more GPUs, more data centers. OpenAI, Google, xAI, and Meta keep pouring concrete and silicon into the scaling law. It works, to a point. And then it stops working.
DeepSeek just published a paper that attacks this problem from the opposite direction. No new model release, no benchmark wars, no viral app store moment. Just a technical contribution called MHC (Manifold Constrained Hyperconnections) that redesigns how information flows through a neural network. It's the kind of work that won't make the front page of Reddit, but it will quietly shape how frontier models are built for the next several years.
I spent time with the paper and a detailed technical breakdown by Kale Bright, and the implications are worth unpacking for anyone building with AI or making decisions about where this technology is headed.
Why Bigger Stopped Meaning Better
The intuition that larger models should be smarter makes perfect sense on the surface. A 20-layer neural network is fairly capable. A 56-layer version should be more capable. More room for knowledge, more capacity for reasoning.
Except it doesn't work that way. When researchers tested this by simply stacking more layers onto existing architectures, performance actually got worse. The model became bigger and dumber at the same time.
The reason is a training problem called vanishing and exploding gradients. During backpropagation (the process where a model learns from its mistakes), the error signal has to travel backwards through every layer. Each layer slightly distorts that signal. In a 20-layer network, the distortion is manageable. In a 56-layer network, the signal either fades to nothing or amplifies into chaos by the time it reaches the early layers.
Think of it as a game of telephone, but with drawings. You draw a cat, pass it to the next person, who redraws it and passes their version forward. By the twentieth person, you still recognize the whiskers and the ears. By the fiftieth person, you're looking at something that might be a potato. The training signal that tells early layers "hey, adjust yourself" gets so degraded that those layers stop learning anything useful.
This was a fundamental barrier. You couldn't just buy more intelligence by making the model bigger. The architecture had to change.
The ResNet Foundation (and Its Quiet Limitation)
In 2015, researchers introduced ResNet (Residual Networks), a simple but powerful idea. Instead of forcing information to flow exclusively through each layer's transformation, they added a bypass channel. The original input gets added directly to the output of each layer. Two streams of information: one that gets processed and transformed, one that passes through cleanly. They merge at the end of each block.
This solved the vanishing gradient problem. The clean bypass channel preserved the training signal, allowing it to reach early layers without being degraded by dozens of intermediate transformations. ResNet won first place in the 2015 ImageNet competition and became the default architectural choice for nearly every neural network that followed, including the transformers that power today's LLMs.
For about ten years, nobody seriously challenged this plumbing design. It worked. Models got bigger, performance scaled, and the residual connection became invisible infrastructure, like the copper wiring in your walls. You don't think about it because it just works.
But "works" and "works optimally" are two different things.
When this residual architecture moved from image classification to language models, a new decision emerged: where to place the normalization layer. The original transformer paper (Google, 2017) placed it after the residual addition (post-layer norm). GPT-2 moved it inside the sub-layer (pre-layer norm), which made training more stable.
Pre-layer norm became standard because it avoided the gradient instability of post-norm. But it introduced its own problem: representation collapse. Deep layers started producing increasingly similar outputs, as if the extra capacity you added wasn't actually contributing new intelligence. You paid for 80 floors but everyone worked on the same three.
For years, model builders had to pick their poison. Stable training with wasted depth, or full use of depth with unstable training. Neither option was great.
ByteDance Opens the Door, DeepSeek Walks Through
In late 2024, ByteDance (TikTok's parent company) proposed something called Hyperconnections. Instead of a single bypass channel, they split the input into multiple sub-vectors and created parallel streams of information flowing through the network.
The idea was elegant. By partitioning the representation across different routes, you avoid the collapse problem. Different streams carry different aspects of the signal, and the model learns to refine each one independently. ByteDance's ablation studies showed 1.8 times faster convergence and meaningful performance improvements with minimal extra compute.
One problem: it was unstable at scale. As you stacked layers, the multiple streams amplified unpredictably. By layer 60, the amplification factor hit 3,000 times. Training became chaotic. The concept was sound, but the execution couldn't survive the scale that frontier models require.
This is where DeepSeek's MHC enters.
DeepSeek's contribution is a mathematical constraint that tames the chaos of hyperconnections. They force the mixing matrices (the weights that control how streams interact) onto a geometric structure called a Birkhoff polytope. In practical terms, this means every row in the matrix sums to one and every column sums to one (a doubly stochastic matrix). They achieve this using the Sinkhorn-Knopp algorithm, which iteratively normalizes the matrices to fit this structure.
The result: you get all the benefits of ByteDance's multi-stream architecture (richer representations, faster convergence, better reasoning) without the training instability that made it impractical at scale. Predictable behavior, higher information density, and stable training. All from redesigning the plumbing.
Why This Matters More Than a New Model
I've been watching the AI race from two angles. As someone who builds with these models daily (Claude Code with Opus 4.6 is my primary tool), and as someone who teaches executives about AI adoption at the University of Naples and ESCP. From both angles, DeepSeek's approach tells a more interesting story than any single model release.
The US AI industry has largely followed one playbook: scale. More parameters, more data, more compute. It's working, but it's expensive and it's hitting diminishing returns. The next 10x improvement in model capability probably won't come from a 10x increase in training compute. The gains are in efficiency, in how well each parameter contributes to intelligence.
DeepSeek and ByteDance are operating under different constraints. Export controls limit their access to cutting-edge chips. Their compute budgets are a fraction of what OpenAI or Google can deploy. So they're forced to be creative about architecture. Not "how do we throw more resources at this?" but "how do we extract more intelligence from the resources we have?"
MHC is a direct expression of that constraint. It doesn't require more compute. It reorganizes existing compute to flow more effectively. And the principles transfer: any lab can adopt this approach, regardless of their GPU budget.
I see a parallel in my work at GE Aerospace. The best improvements in manufacturing operations rarely come from buying newer, more expensive machines. They come from redesigning the process, from changing how information and materials flow through the existing system. DeepSeek is doing the same thing with neural network architecture.
When I explain this to the executives in my courses, I frame it as a strategic question. The US is betting heavily on compute infrastructure (hundreds of billions in data center construction). China is betting on architectural efficiency. Both approaches produce frontier models today. But as we approach the physical and economic limits of scaling, the efficiency research becomes increasingly valuable. The lab that figures out how to get GPT-5 level performance from GPT-4 level compute has a structural advantage that no amount of GPU spending can match.
What Builders Should Take Away
MHC is not going to change your workflow tomorrow. It's a research paper, not a product. You won't see a "MHC-powered" label on the next Claude or GPT release.
But here's what it signals.
The transformer architecture, which has been remarkably stable since 2017, is now actively being redesigned. The changes are coming from Chinese labs (DeepSeek, ByteDance) who are publishing openly and iterating on foundational components that Western labs have left untouched for years. And architectural improvements compound. ResNet unlocked a decade of progress. MHC (or techniques like it) could unlock the next phase.
If you're making decisions about AI infrastructure, this is worth factoring in. The models you're building on will get meaningfully better without requiring proportionally more compute. That changes the economics of every AI project. Training costs come down. Inference costs come down. And the performance ceiling goes up.
DeepSeek's V4 model is expected later this month. If it incorporates MHC into a production model and the results match the paper's claims, the conversation shifts from "interesting research" to "new standard." Other labs will adopt similar approaches within the year, just as Mixture of Experts spread after DeepSeek demonstrated it with V2 and V3.
The most consequential advances in AI right now aren't the flashy model releases. They're the architectural improvements to the invisible infrastructure underneath. DeepSeek just rewired the plumbing. Pay attention to what flows through it next.
