Ivan Paudice LIVE

Ivan Paudice

Innovation Lead

Moonshot's latest model, Kimi K 2.5, orchestrates up to 100 agents in parallel and handles 1,500 tool calls without an application layer or framework on top. The swarm orchestration is trained directly into the model through reinforcement learning.

When I first saw the numbers, my reaction was skepticism. The AI industry has a clear division of labor: labs build the intelligence, applications figure out how to make it useful. OpenAI and Anthropic train models. Perplexity, Manus, and Claude Code add orchestration on top. Until now, nobody had trained orchestration into the model itself. Moonshot did, and the approach is surprisingly simple.

If you've been building with AI agents, you know the bottleneck. Models are smart enough. The hard part is making multiple agents coordinate reliably and finish faster than one agent running tasks sequentially. That's engineering work that takes months to get right. Moonshot trained a model to handle it natively through reinforcement learning, and the result is 4.5 times faster task completion compared to application-layer orchestration doing the same work. Because the coordination is learned rather than engineered, the cost scales with inference pricing rather than engineering headcount.

‍

How Moonshot Got Here

Moonshot built toward this in three-month cycles, each release stacking a new capability. Kimi K2 launched in July 2025 with a Mixture of Experts architecture (384 experts, eight activated per token) and put Moonshot on the map as a frontier lab. K2 Thinking followed in November, adding interleaved reasoning with the stability to chain 200 to 300 tool calls without losing coherence. The model could now work through multi-step problems reliably, which set the foundation for what came next. Then K2.5 arrived in January with native multimodality through a 400 million parameter vision encoder, and the agent swarm.

K2.5 ranks fourth on the Artificial Analysis benchmark, behind GPT 5.2, Opus 4.5, and Gemini 3 Pro. For an open model, that's a strong position. It also took first place in the design arena (users upload video recordings of websites and the model replicates them from scratch), making it the first open model to beat closed models on design quality. But the swarm capability is what reshapes the architecture conversation.

‍

How PARL Works

They call the approach PARL: Parallel Agent Reinforcement Learning.

Training starts with a simple incentive. Early in the process, the reward function includes a bonus (lambda set to 0.1) for parallelizing tasks. The model gets rewarded for breaking work into parallel streams, even when doing things sequentially would also produce a correct answer.

As training progresses, that bonus gradually drops to zero. By the end, the model is only rewarded for reaching the right answer efficiently. The model uses parallelism when it helps and skips it when the overhead outweighs the gain.

When I explain this to the executives I teach, I use an analogy. Training PARL is like onboarding a new employee into delegation. At first, you actively encourage them to delegate tasks to their team, even when doing the work themselves might be faster. The goal is building the delegation muscle. Once they know how to coordinate, you remove the scaffolding and let them optimize on their own. Some tasks they'll delegate. Others they'll handle solo. The point is they learned the skill before they needed to be efficient at it.

Moonshot calls the failure mode serial collapse: the tendency for models to default to sequential processing because it's safer. Without the early parallelism bonus, models never learn coordination. The sequential route works fine, so why try something harder?

The mechanism that makes this practical is what they call critical steps. Consider a task requiring 70 steps total. The orchestrator spawns four agents. The work splits unevenly: one agent handles 40 steps, the other three handle 10 each. The parallel path costs 44 steps (the longest agent's 40, plus 4 for orchestrator overhead). The sequential path costs 71. Parallelism wins clearly.

Now consider a four-step task. Four agents each doing one step, plus four steps of orchestrator overhead, totals eight. Sequentially, five. The model learns to skip parallelism here.

Critical steps teaches the model to parallelize only when it shortens the total path. Coordination has a real cost, and the model learns to weigh it against the benefit.

This mirrors Mixture of Experts at a higher abstraction level. In MoE, a router selects which experts to activate per token. In PARL, an orchestrator selects which agents to spawn per task. Activate resources only when the problem justifies the coordination cost. Moonshot's founder, Xiling Yang, has roots in this architectural thinking going back to XLNet, which improved upon BERT. The PARL design reflects that same instinct: start from how the model learns, then build the architecture around the learning process.

One clarification on where the swarm lives: the LLM doesn't execute agents directly. It generates output tokens describing the orchestration plan: how many sub-agents to create, what tasks to assign, how to aggregate results. The model plans the coordination, and the runtime infrastructure carries it out. This means any capable model could learn this through the same reinforcement learning approach. Moonshot proved the concept works.

‍

What This Changes

I use Claude Code with Opus 4.6 every day. One model driving file navigation, code generation, testing, and debugging. The orchestration happens in the application layer, built and maintained by Anthropic's engineering team. It works well, and over the past year I've pushed it from simple code generation to multi-file refactors that touch dozens of files at once.

Kimi K 2.5 points toward a different architecture. If the model itself learns to orchestrate agents, the application layer gets thinner. The coordination logic that teams currently build and maintain becomes a model capability instead.

Picture a codebase migration. A real one: dozens of interconnected files that need updating from one framework to another, with shared utilities, cross-file imports, and test suites that depend on all of it. A sequential agent works through them one at a time, carefully tracking dependencies, taking an hour. With native orchestration, the model analyzes the dependency graph first. It identifies which modules can be updated independently and spawns agents across them. One handles the data models, another rewrites the API endpoints, a third updates the tests. Where files reference each other, the orchestrator coordinates the handoffs. What took an hour finishes in twelve minutes. No orchestration framework to configure, no routing logic to debug. The model figured out the coordination and the critical path on its own.

I wrote recently about Minimax M2.5 and the falling cost of intelligence. The annual cost of running that model around the clock came out to $1,892. Kimi K 2.5 pushes the same thesis from a different direction. Minimax proved inference can get 97% cheaper. Moonshot is proving that agency can move from the application layer into the model, removing another cost entirely: the engineering overhead of building orchestration systems. Between Minimax making inference nearly free and Moonshot making orchestration native, the cost barriers to running multi-agent systems are falling from both directions at once.

I teach executives about AI adoption, and the architecture question surfaces in every session. How do you build reliable multi-agent workflows? Right now the answer involves months of engineering: routing logic, error handling, state management across agents. If the model handles coordination natively, that barrier drops to choosing the right model. For organizations still figuring out how to deploy one AI tool reliably, the engineering barrier dropping to "pick the right model" is a significant shift. It also changes the hiring conversation. If orchestration becomes a model feature, the team you need to build multi-agent systems looks different than it did six months ago.

Markets already reflect this shift. Meta acquired Manus, an application layer built on other companies' models, for $2 billion, valued higher than some frontier labs doing original research. Moonshot sits at $4.8 billion and is positioning on both sides: frontier intelligence and native agency. When the weights go open source, any team can build on this approach without API dependencies or licensing constraints.

‍

What I'm Watching

The boundary between model and agent is dissolving. Kimi K 2.5 is an early signal, not the final form. The PARL approach is well documented and reproducible. Other labs will adopt it. When Moonshot publishes the weights, researchers can study how PARL shapes the model's orchestration decisions. Builders can fine-tune for their own workflows. The technique will likely appear in competing model families within the year, just as Mixture of Experts spread after DeepSeek demonstrated its viability at scale.

If model-native orchestration becomes standard in the next generation of frontier models, the orchestration frameworks you're building today may become model features tomorrow. The application layer that companies have spent years engineering, the routing, the state management, the coordination logic, could thin out rapidly once models handle it themselves. That's worth factoring into the architecture decisions you're making right now.

What would you build differently if your model could spawn its own agents?

‍

100 Agents, Zero Engineering

How Moonshot Got Here

How PARL Works

What This Changes

What I'm Watching

Other notes.

The 10-Minute War

The Clawdbot Paradox