We believe the single biggest point of failure is before any code is written. This is solved with the planning layer - something that no other company has touched on adequately. Claude dynamic workflows might be a type of decomposition strategy similar to RLM, but that is still the execution layer. If you can estimate complexity and costs up front -> then you are plugging the biggest token leaks before they start.
How do you compete in the AI coding game against giants like Anthropic and OpenAI?
You don't outspend them. You out-engineer their harnesses.
"The harness matters. Especially when it comes to cost efficiency and token usage. If you can get both an intelligence gain and a cost gain, then that’s really how you're chasing the edge in this coding game.”
We break down how we just broke the model cost continuum.
1. On off-the-shelf SDKs:
"We hand-built this harness because we tried the Claude Code SDK and the Open Code SDK. They both had gaping leaks."
2. On picking a benchmark that actually mirrors real engineering:
"You have two types of benchmarks. Some just introduce isolated patches, but they aren't really multi-agent turn. We picked Terminal Bench because it tests multi-file, multi-turn execution right in the terminal."
3. On the results that broke the cost curve:
"Our Haiku scored higher than Sonnet 4.5 on the leaderboard. We did about 16% better than Claude Code on Haiku. That’s a three times cost difference, and we somehow jumped that."
4. On the core thesis of next-gen software engineering:
"The harness matters. Especially when it comes to cost efficiency and token usage. If you can get both an intelligence gain and a cost gain, then that’s really how you're chasing the edge in this coding game."
If you have been using Claude Code professionally, take a minute to read this.
We beat Opus with Sonnet by using the predev harness. Here is what it means for agentic coding:
Orchestration beats brute reasoning. A smaller model running on our architecture just beat Claude Opus on Terminal-Bench 2.0.
For the last two years, the default way to improve an AI coding agent has been simple: throw more money at the model. Upgrade the tier, burn more tokens, buy a bigger context window, and hope for better code.
But a model's raw weights matter far less than the system architecture wrapping it.
Seeing a clear ROI on every shipped feature has a direct impact on the success of your business. So does spending 50% more tokens per feature.
We wanted to prove that a smarter harness could break the trend of relying on brute compute. So we put it to the test on Terminal-Bench 2.0.
We ran predev + Sonnet 4.6 on the Harbor reference harness against the Terminal-Bench 2.0 task set. The Claude Code numbers were taken directly from their public submissions on tbench.
The final result: predev + Sonnet 4.6 scored a 56.2% pass rate. Claude Code running the massive Opus 4.5 scored 53.9%.
We dropped an entire model tier and finished ahead. Accuracy went up, while the per-task model bill went down.
We didn't close that performance gap by paying for a premium model; we did it by spending tokens more efficiently. Buying a bigger model just trades dollars for points. A better harness gets the points without making that trade.
Frontier labs build mass-market engines. They cannot highly specialize their layer for deep engineering tasks. It is exactly like a database program—the engine doesn't tell you how to organize your application data.
We built our harness narrowly and explicitly to solve complex, production-ready systems engineering. The core architecture rests on three uncompromising loops:
It plans before it codes. The agent extracts a structured blueprint with milestones before a single file is touched.
It uses dynamic execution paths. Leveraging ToDo dependency graphs and parallel analysis.
It verifies before passing. A blind verifier re-runs acceptance criteria against the output and disagrees freely.
You don't need to be a frontier lab to beat one. You just need a system engineered for the actual work.
We built predev for real, live use cases and customers.
predev customers include software development agencies building products for clients, software vendors building proofs-of-concept for their prospects, enterprise teams building internal data pipelines, and startups getting their MVP out on time and on budget.
If ROI matters to your business, stop burning tokens on brute force. Head over to predev, run your project through our harness—be it an existing codebase or greenfield—and watch us ship.
The open-sourced Harbor trajectories and repository are live on GitHub. The full breakdown of the results is live on our website.
(Links to both in the comments).
Coding with AI feels a lot like poker: I have to wager tokens with high variance on whether the solution is legitimate, while balancing the possibility that the opponent (the agent) is bluffing me. I have imperfect information because I can't account for every single bash command and line of code, while having to make live reads on if the agent's information is accurate.
The Artificial Intelli-Gents Ep. 7: We Beat Claude Opus with a Smaller Model
We did it: We dropped an entire model tier and still finished ahead. By proving that orchestration beats brute reasoning, our predev harness paired with Sonnet 4.6 (56.2%) officially beat Claude Code running Opus 4.5 (53.9%) on the grueling Terminal-Bench 2.0. The best part? We achieved higher accuracy while significantly cutting our per-task token bill.
Also in episode 7 of The Artificial Intelli-Gents, we dive into the current state of RLM, why "harness engineering" is the next frontier, and how to leverage forward-deployed engineers.
Timestamps:
2:28 - Why Terminal-Bench
9:33 - How we did it & what the results mean
18:05 - What makes our harness unique
27:00 - Building our own Browser Agents
34:35 - Implementing Recursive Language Models
41:20 - The future of benchmarks
51:00 - Forward Deployed Engineers & Dev Shops
1:04:00 - The harness of harnesses
1:17:37 - Upcoming Releases
@adampredev@ArjunRajJain
Uber spent its entire annual AI budget in one quarter. The creator of Openclaw burns $1M a month on Codex tokens.
What is the true ROI on AI token spend? Here is how we measure and increase it.
When Fortune 100 CFOs question tokenmaxxing and Microsoft cancels Claude subscriptions, it points to an underlying issue.
Our engineering patterns are evolving for raw output, not efficiency: the illusion of free code.
Because a developer doesn’t see a cash meter running while typing prompts, writing code suddenly feels free.
So we run three agents in parallel just to pick the fastest output, or throw Opus at a basic typo because it's easier than switching models.
No one wants to touch code they didn't write, so they just use AI. It becomes a compounding cycle.
But when code feels free, foundational engineering discipline erodes.
Vague, one-sentence prompts replace rigorous user stories. The ergonomics have changed; no one wants to jump into Jira anymore.
They go straight to the terminal, letting the agent make an educated guess. If it's off, they just start over, it’s faster than writing extensive briefs.
We end up with an explosion of output: a mountain of code generated by a mountain of tokens. But how much actually makes it to production?
Did we build an overly complex solution to a simple problem just because compute allowed it? We accelerated output, but didn't increase productivity.
We ran a dev shop for years. When clients pay for shipped product rather than raw hours, ROI is your entire business.
Scarcity breeds process innovation. True efficiency doesn't happen while the agent is typing. It happens before a single line of code is written.
We baked those exact operational mechanics directly into the predev. We pre-dev before we build.
It enforces planning. Tasks break down with an estimation directly inside your workflow, syncing with Jira to eliminate context switching.
It selects the right tool. With proper briefing, the harness chooses the optimal model tier. For a basic button component, you don't need premium reasoning.
It leverages recursive long-term memory. Like a veteran engineer who knows a codebase's quirks, predev builds a compounding memory of common bugs, libraries, and tooling issues with every task.
It blind-verifies everything. It runs strict acceptance criteria against the output to kill false positives before they ever pollute your repository.
The result is total visibility over engineering margins. Every credit spent is traceable down to an exact user story, subtask, and pull request.
You can finally make intelligent spend decisions based on actual business value, rather than letting unconstrained agents run wild on your corporate credit card.
The point is you finally have control. You can make a calculated spend decision based on the exact value of a task, all while remaining highly efficient.
If you are ready to optimize your engineering ROI, head over to predev to book a demo.
Can a custom software harness make a low-tier model outperform a premium frontier model that costs three times as much?
Most enterprise teams are facing skyrocketing token bills because they think raw capital is the only path to intelligence.
On the latest episode of the Artificial Intelli-Gents Podcast, predev co-founders Adam and Arjun break down exactly why the industry has hit a model cost wall.
Instead of waiting for updates from foundational labs, Adam Elkassas (@adampredev) and Arjun Raj Jain (@ArjunRajJain) hand-built a native cloud harness from scratch to fix the severe memory leaks and structural shortcuts found in standard SDKs.
They ran their system against Terminal Bench, the most rigorous multi-file coding benchmark in the industry. The results broke the standard model cost continuum:
Running on a low-cost tier like Claude Sonnet, predev’s native harness outperformed Anthropic’s own Claude Code running on premium Opus 4.5.
How do you manufacture that kind of lopsided intelligence gain while cutting token costs by two-thirds?
Here is a breakdown of the episode and how they out-architected foundational labs valued at hundreds of billions:
- Why Terminal-Bench: The truth about why standard benchmarks fail to test true, multi-turn agent capability in the wild.
- Breaking the Cost Continuum: The exact mechanics of how predev helps users maximize intelligence per token without breaking the bank.
- What Makes Our Harness Unique: Moving past simple for-loops into long-horizon planning layers and long-term execution graphs.
- Building Custom Browser Agents: How Arjun built a specialized browser agent layer that runs at 3x the speed and 1/3 the cost of market alternatives.
- Implementing Production RLM: The blueprint behind being the first team to truly implement Recursive Language Models to achieve unlimited execution depth.
- The Reality of Enterprise AI: Why raw agents fail out of the box in production, and how forward-deployed engineers scale MVPs to enterprise security standards.
- Upcoming Releases: A sneak peek into predev's next-gen CLI release, local-to-cloud syncing, and multi-session isolated sandboxes.
While frontier labs pour tens of billions into physical data centers, the real software alpha is being captured at the orchestration layer.
If you want to see how architectural execution beats raw compute capital, this episode is your blueprint.
Full episode in the quoted post below.
The Artificial Intelli-Gents Ep. 7: We Beat Claude Opus with a Smaller Model
We did it: We dropped an entire model tier and still finished ahead. By proving that orchestration beats brute reasoning, our predev harness paired with Sonnet 4.6 (56.2%) officially beat Claude Code running Opus 4.5 (53.9%) on the grueling Terminal-Bench 2.0. The best part? We achieved higher accuracy while significantly cutting our per-task token bill.
Also in episode 7 of The Artificial Intelli-Gents, we dive into the current state of RLM, why "harness engineering" is the next frontier, and how to leverage forward-deployed engineers.
Timestamps:
2:28 - Why Terminal-Bench
9:33 - How we did it & what the results mean
18:05 - What makes our harness unique
27:00 - Building our own Browser Agents
34:35 - Implementing Recursive Language Models
41:20 - The future of benchmarks
51:00 - Forward Deployed Engineers & Dev Shops
1:04:00 - The harness of harnesses
1:17:37 - Upcoming Releases
@adampredev@ArjunRajJain
The Artificial Intelli-Gents Ep. 7: We Beat Claude Opus with a Smaller Model
We did it: We dropped an entire model tier and still finished ahead. By proving that orchestration beats brute reasoning, our predev harness paired with Sonnet 4.6 (56.2%) officially beat Claude Code running Opus 4.5 (53.9%) on the grueling Terminal-Bench 2.0. The best part? We achieved higher accuracy while significantly cutting our per-task token bill.
Also in episode 7 of The Artificial Intelli-Gents, we dive into the current state of RLM, why "harness engineering" is the next frontier, and how to leverage forward-deployed engineers.
Timestamps:
2:28 - Why Terminal-Bench
9:33 - How we did it & what the results mean
18:05 - What makes our harness unique
27:00 - Building our own Browser Agents
34:35 - Implementing Recursive Language Models
41:20 - The future of benchmarks
51:00 - Forward Deployed Engineers & Dev Shops
1:04:00 - The harness of harnesses
1:17:37 - Upcoming Releases
@adampredev@ArjunRajJain
Episode 7 of the Intelli-Gents Podcast out now! with @ArjunRajJain@predotdev
We speak about topics ranging from TBench2.0, harness engineering, coding agent self-improvement, RLM, and more!
https://t.co/wMHzFDgYQZ
@ArjunRajJain and I recorded a 90 min episode of the Intelli-gents. We are releasing this on YouTube on Monday.
We touch on topics such as coding benchmarks, RLM, self-improvement, harness engineering, and more!
I as a human expert dev with 15+ years of experience coding, still can drive Opus 4.8 on Claude Code in a manual Ralph loop for a full day on a single feature without success. This is what non-coders don't understand right now.
There are many days where you can code a full day without moving the needle even with perfect prompting and human assistance
This is a mismatch in expectations that vibe coders will have to learn the hard way, because the truth is to get one decent feature working perfectly in prod you need to be iterating heavily, scratching and rewriting, rethinking multiple times over.
Uber spent its entire annual AI budget in one quarter. The creator of Openclaw burns $1M a month on Codex tokens.
What is the true ROI on AI token spend? Here is how we measure and increase it.
When Fortune 100 CFOs question tokenmaxxing and Microsoft cancels Claude subscriptions, it points to an underlying issue.
Our engineering patterns are evolving for raw output, not efficiency: the illusion of free code.
Because a developer doesn’t see a cash meter running while typing prompts, writing code suddenly feels free.
So we run three agents in parallel just to pick the fastest output, or throw Opus at a basic typo because it's easier than switching models.
No one wants to touch code they didn't write, so they just use AI. It becomes a compounding cycle.
But when code feels free, foundational engineering discipline erodes.
Vague, one-sentence prompts replace rigorous user stories. The ergonomics have changed; no one wants to jump into Jira anymore.
They go straight to the terminal, letting the agent make an educated guess. If it's off, they just start over, it’s faster than writing extensive briefs.
We end up with an explosion of output: a mountain of code generated by a mountain of tokens. But how much actually makes it to production?
Did we build an overly complex solution to a simple problem just because compute allowed it? We accelerated output, but didn't increase productivity.
We ran a dev shop for years. When clients pay for shipped product rather than raw hours, ROI is your entire business.
Scarcity breeds process innovation. True efficiency doesn't happen while the agent is typing. It happens before a single line of code is written.
We baked those exact operational mechanics directly into the predev. We pre-dev before we build.
It enforces planning. Tasks break down with an estimation directly inside your workflow, syncing with Jira to eliminate context switching.
It selects the right tool. With proper briefing, the harness chooses the optimal model tier. For a basic button component, you don't need premium reasoning.
It leverages recursive long-term memory. Like a veteran engineer who knows a codebase's quirks, predev builds a compounding memory of common bugs, libraries, and tooling issues with every task.
It blind-verifies everything. It runs strict acceptance criteria against the output to kill false positives before they ever pollute your repository.
The result is total visibility over engineering margins. Every credit spent is traceable down to an exact user story, subtask, and pull request.
You can finally make intelligent spend decisions based on actual business value, rather than letting unconstrained agents run wild on your corporate credit card.
The point is you finally have control. You can make a calculated spend decision based on the exact value of a task, all while remaining highly efficient.
If you are ready to optimize your engineering ROI, head over to predev to book a demo.