Ivan Paudice LIVE

Ivan Paudice

Innovation Lead

OpenAI didn't even give Anthropic ten minutes. Opus 4.6 launched, and before the tech press could finish their first draft of "Anthropic takes the lead," GPT-5.3 Codex was live. Two frontier models, two massive companies, released within minutes of each other.

I've been building with Claude Code and Opus every day for over a year. I watched this unfold in real time from both sides: as a daily user of Anthropic's model, and as someone who regularly evaluates OpenAI's releases for the courses I teach. My first reaction was exhaustion. My second was curiosity. Because when you strip away the marketing drama, these two releases tell very different stories about where AI is headed.

Caleb Wright's technical breakdown of both models pulled the numbers together clearly, and the picture that emerges is worth understanding. Not because you need to pick a winner, but because the competitive dynamics between these two companies are compressing what used to take years into weeks.

‍

Context Windows: The Numbers That Actually Matter

One of the persistent frustrations developers have with large language models is context window size. How much information can the model hold in its working memory while it processes your request?

Google's Gemini offers 1 million tokens. OpenAI's GPT has been sitting at 400,000. Anthropic's previous model maxed out at 200,000. And while agentic applications like Claude Code manage context efficiently (summarizing, compacting, selectively loading files), 200,000 tokens was clearly behind the pack.

You might think scaling the context window is a simple engineering problem. Add more memory, handle more tokens, ship it. In practice, there's a brutal tradeoff. As context windows grow, accuracy drops. The model starts forgetting, hallucinating, or confusing facts buried deep in the conversation.

There's a benchmark for this called MRCR (Multi-Round Context Retrieval). Think of it as the "needle in a haystack" test. You embed repeated facts throughout a long context and ask the model to correctly identify specific ones buried somewhere in the middle. The "eight needle" variant makes it harder: eight identical facts scattered through the window, forcing the model to track position, not just content.

At 1 million tokens, the previous best accuracy on this benchmark was 32.6%. That's one in three. Google's Gemini 3 scored around 25%. Not great when you're building applications that depend on the model remembering what you told it 500,000 tokens ago.

Then Anthropic dropped Opus 4.6 with a score of 76% at 1 million tokens.

That's not an incremental improvement. Anthropic jumped from 200,000 to 1 million tokens while more than doubling the accuracy at full context. For anyone building complex applications that require the model to reason across large codebases or long conversations, this is the number that changes what's possible.

I felt this in practice before I saw the benchmarks. Working with Opus 4.6 through Claude Code, the model holds coherence across longer sessions than its predecessor. When I'm doing multi-file refactors that touch twenty or thirty files, the model remembers the architectural decisions from the beginning of the session. With the previous Opus, I'd start noticing drift after an hour. Repeated instructions, lost context, the model re-discovering things I'd already explained. With 4.6, the drift window has expanded significantly.

‍

GPT-5.3 Codex: Speed and Terminal Mastery

OpenAI took a different angle with GPT-5.3 Codex. They kept the context window at 400,000 tokens and focused on two things: speed and agentic capability.

First, inference speed. A 25% increase in raw speed compared to the previous Codex release. One of the biggest complaints about the Codex line was latency. If you've used both Claude Code and Codex CLI side by side, you know the feeling. Opus thinks longer but produces more coherent output. Codex was faster on simple tasks but slower on complex ones, and the overall pace felt sluggish compared to where it should be. The 25% bump addresses this directly.

Second, and more interesting to me: terminal navigation. Since Claude Code launched in February 2025 and Codex CLI followed in April, terminal-based agents have become the primary interface for how frontier models interact with codebases. The model lives in your terminal. It reads files, runs commands, executes tests, navigates directories. How well it handles that environment determines how useful it is for real work.

There's a benchmark for this too, called TerminalBench. Eighty-nine isolated Docker environments, each with a different task: building repositories, setting up servers, training models, debugging failing test suites. You drop the model into a container and see if it can solve its way out using only terminal commands.

GPT-5.2 Codex scored 64%. GPT-5.3 jumped to 77% (with the official record noting 75%, close enough). That's a meaningful leap in the model's ability to operate independently in the environment where developers actually use it.

OpenAI also mentioned that GPT-5.3 Codex was used to assist in building itself. The model contributed to its own training pipeline. This sounds dramatic, and it is. We're at the point where models are capable enough to contribute meaningfully to the research process that produces the next version. The recursive improvement loop isn't theoretical anymore. It's happening in production at OpenAI.

‍

The Pace Is the Story

Step back from the individual benchmarks and look at the release cadence.

Anthropic's Opus line has been iterating every two to three months. Opus 4.5 in late 2025, Opus 4.6 in February 2026. Steady, methodical improvements.

OpenAI started with roughly four-month gaps between major releases, compressed to two months, and is now pushing toward one to two months between iterations. GPT-5.0 to 5.1 to 5.2 to 5.3, each cycle shorter than the last.

We could see monthly releases from both companies by summer. Possibly bi-weekly point releases by year end. Each release doesn't need to be a revolution. Small, compounding improvements every few weeks add up fast when the baseline is already frontier-level intelligence.

For anyone building on top of these models, this pace creates a specific kind of pressure. The model you build your product on today will be two generations old in four months. Your architecture needs to handle model swaps cleanly, or you'll spend more time upgrading than building.

I've restructured my own workflows around this reality. When I build tools with Claude Code, I abstract the model layer so I can swap between versions without rewriting the logic on top. It's an extra hour of work upfront that saves weeks when the next version drops and I want to take advantage of it immediately.

‍

The Price Gap Nobody Talks About Enough

Here's where the competitive picture gets complicated.

GPT-5.3 Codex: $1.75 per million input tokens, $14 per million output tokens.

Opus 4.6: $5 per million input tokens, $25 per million output tokens.

Opus costs roughly three times more on input and nearly twice as much on output. And the word from developers using Opus 4.6 heavily is that the model is token-hungry. It generates longer, more detailed outputs. The 1 million token context window means sessions can accumulate significant token counts before you hit the end.

On ChatGPT's $20 plan, most users get enough headroom for regular Codex usage. On Claude's equivalent plan, the 5-hour refresh window runs tight, especially with heavy coding sessions using Opus 4.6. I've hit the limit multiple times during extended refactors, and the forced pause breaks the flow.

This is where the competition gets interesting beyond benchmarks. Anthropic built the better context window and the more reliable model for complex reasoning. OpenAI built the faster, cheaper option that handles terminal work well. Neither is strictly better. The right choice depends on what you're building and how much you're willing to spend.

I've settled into a pattern that reflects this. Opus for architectural work, complex multi-file changes, and anything that needs deep context. OpenAI's models for quick iterations, isolated tasks, and high-volume workloads where cost matters more than depth. Two tools for different jobs, not a loyalty contest.

‍

What I'd Tell My Students

I teach venture design at University of Naples and ESCP, and AI infrastructure decisions come up in every session now. Students ask me which model to build on. Executives ask me which company to bet on. The answer has become hard, and I think that difficulty is the most important signal of all.

Eighteen months ago, OpenAI was the clear default. Twelve months ago, Claude Code and Opus shifted the conversation. Six months ago, open-source models from DeepSeek and Minimax started closing the gap from below. Today, two frontier labs are releasing within minutes of each other, and the right answer depends on the specific use case.

This is what competition looks like when it's working. Anthropic pushes context quality, OpenAI pushes price and speed, and every builder benefits from the race. The worst outcome would be one company running away with it. Instead, we're watching two companies trade punches in real time, each forcing the other to improve faster.

If you're building with AI right now, here's the practical takeaway. Don't pick a side. Build for flexibility. Abstract your model layer. Test new releases as they drop. The ten-minute gap between these two launches tells you everything about the pace you need to match.

And if you're still evaluating whether to start building with AI coding agents at all, the window for "I'll wait until it stabilizes" is closing. The models are good enough now. Both of them. Pick either one and start building while the labs fight it out.

The race just got faster. Your move.

The 10-Minute War

Context Windows: The Numbers That Actually Matter

GPT-5.3 Codex: Speed and Terminal Mastery

The Pace Is the Story

The Price Gap Nobody Talks About Enough

What I'd Tell My Students

Other notes.

The Clawdbot Paradox

DeepSeek's Plumbing Fix for AI's Biggest Bottleneck