Introduce Klavis MCP SaaS (Sandbox-as-a-service) !
Reinforcement Learning environments are becoming the new labeled datasets.
Just as curated data powered the last wave of AI breakthroughs, training environments are powering the next one: AI agents that can actually use tools.
But here's the problem:
Training models to use Gmail, Salesforce, Slack, or Jira requires things are complex and painful to build yourself:
→ Managing hundreds of authenticated test accounts
→ Initializing realistic data states for each training episode
→ Resetting state between runs
→ Ensuring isolation across concurrent training sessions
→ Providing verifiable sandbox state for reward signals
Most research teams spend months on this infrastructure before a single training run.
That's why we are launching a managed MCP Sandbox-as-a-Service for RL training on tool use in Klavis AI.
One API call → isolated sandbox backed by real service instances.
The training loop becomes simple:
Initialize → Seed sandbox with custom data state
Interact → Model executes actions via MCP tools
Dump → Get final state snapshot
Compute reward → Compare dump vs. target state
Reset → Return to pristine state instantly
Deterministic. Reproducible. Parallelizable.
If you're training models on tool use, we'd love to chat.
Most AI agent benchmarks are a total waste of time.
They rely on static, low-fidelity mockups that don't exist in the real world. If you’re training agents on simplified environments, don't be surprised when they fail in production.
@AnthropicAI latest guide on demystifying evals makes one thing clear: you are "flying blind" without rigorous, repeatable testing.
But here is the problem: Evals are only as good as the world they live in.
To build a truly generalist agent, you need more than a mockup. You need:
1> long-horizon workflows: Tasks that require dozens of steps, not just one-shot completions.
2> authentic ecosystems: Real software, real state, and real ambiguity.
3> noise and state-dependency: The "messiness" that breaks 90% of agents today.
This is why we built @Klavis_AI Universe.
We don't provide "test cases", we provide scalable universes.
By giving models ultra-realistic settings with 300+ MCP Servers, we enable the kind of RL training and long-horizon evaluation that was previously impossible.
Anthropic says the value of evals compounds over the life of an agent. We say the value of the environment is the ceiling of your agent's intelligence.
Stop testing against toy environments. Start building for the real world.
Most founders think Forward Deployed Engineering (FDE) is just a fancy term for technical support.
They’re wrong.
FDE isn't about fixing bugs for customers, it’s about collapsing the distance between a vision and a production-ready reality.
We spent a full day in a private shack in SF with @DeepAI team
No pitch meeting. No polished demo environment. just system design, whiteboarding, and pair-coding.
We didn’t just talk about how to integrate @Klavis_AI . We went deep on the reality of AI agents in production, tooling infrastructure that actually scales, solving integration blockers in real-time.
The bandwidth of collaboration you get in 6 hours of coding beats 6 months of email threads.
One more thing I learned?
@KevinBaragona shared the story of how @DeepAI became the official sponsor of the Tuvalu National Futsal Team!
It’s these "fascinatingly weird" stories that emerge when you actually get in the trenches with other builders. You don't get that over a Zoom call.
If you aren't spending deep, focused time in the room with your partners, you aren't building, you’re just shipping features.
The future of AI isn't just about better models. It’s about better stories we build along the way.
Build with your users.
The #1 rule of early-stage sales is: "Listen more than you talk."
I failed this rule completely with my first customer.
I met Shoya (@pineforesta) at Equator Coffee shortly after we launched @Klavis_AI . I was nervous. I spent the entire conversation stumbling through a long, complicated pitch. I barely let him get a word in.
By all accounts, I should have lost the sale.
But Shoya didn’t walk away. He listened patiently.
When I finally stopped talking, he didn't ask for clarification. instead, he did something incredible: in just a few simple sentences, he explained his product and exactly where Klavis fit into his workflow. He even opened his laptop to demo my value prop to me.
I learned two massive lessons that afternoon:
1/ True early adopters are rare. They don't need a perfect pitch. They see the vision through the mess and connect the dots themselves.
2/ Great founders simplify complexity. While I was complicating things, Shoya was clarifying them.
Shoya became our first paying customer that day.
He is still on our cheapest "grandfathered" pricing plan. Not just because he is our friend, but because he bet on us when I didn't even know how to sell yet.
Today, Shoya just got accepted into the next @ycombinator batch.
I’m not surprised. He knows how to find the signal through the noise better than anyone I know.
So proud of you, Shoya!
Cursor Head of Design Ryo Lu (@ryolu_) has spent his career at the intersection of design and engineering—from building fan sites as a kid to designing products at Stripe, Asana, and Notion. Now he's rethinking how software itself gets made.
On this episode of Design Review, Ryo joins YC's @aaron_epstein to break down how great product websites communicate what a company does. They walk through sites from early-stage startups, calling out the small choices in structure, clarity, and brand that help users understand a product instantly— and the ones that get in the way.
00:00 - Intro
01:00 - Crunched
05:30 - Velvet
09:00 - Klavis AI
14:30 - Code Crafters
20:40 - Slashy
22:50 - Freya
26:00 - Finta
30:30 - Vibeflow
Super grateful for @ryolu_ and @aaron_epstein diving deep on Klavis website in the latest YC video!
That's the kind of feedback that actually shapes our products...
back to work now!
Managing complex permissions kills velocity...
Check out role-based access control (RBAC) at @Klavis_AI
- Organize your entire team effortlessly.
- Fine-grained control down to specific roles.
- Secure every connector individually.
See it in action below. 👇
We've all been trained to think "bigger is better."
But the reality? Most models start degrading long before that...
Around the 200k mark, the "context rot" begins:
→ Slower inferences
→ Degraded quality
→ Lost context
It's the hidden bottleneck that kills performance.
Stop focusing on the 1M marketing number. The real insight is finding the "pre-rot threshold." and for most models is 128k - 200k.
Use this number as your trigger for context reduction and management.
This is critical when your AI agent interacts with tools or MCP servers. Those interactions consume massive amounts of context and will push you into the "rot" zone faster than you think.