I build the part of AI products that nobody sees until it breaks.
The retrieval layer that decides what context the model gets.
The agent control layer that decides when the system acts, waits, asks, retries, or stops.
The eval layer that catches regressions before users do.
The backend layer that turns a prototype into an API someone can actually depend on.
The cost layer that prevents “AI adoption” from becoming an invoice problem.
The observability layer that explains why the system behaved the way it did.
That is Eleventh.
Not an AI agency selling prompt wrappers.
Not consultants selling transformation decks.
Not demo builders chasing the newest model release.
A technical company focused on production AI infrastructure:
RAG.
Agents.
Evals.
Backends.
Retrieval.
Control planes.
Deployment-ready AI systems.
Most teams already have the demo.
What they don’t have is the infrastructure that makes it reliable enough to use with real customers, real data, real permissions, real latency constraints, and real failure modes.
That is the gap I close.
Because the model is not the product.
The system is.
AI infrastructure, not AI theater.
If you’re a founder or engineer turning an AI prototype into something customers can actually rely on.
Reply with the one thing that’s currently hurting you the most (RAG quality? agent reliability? eval drift? cost explosions? observability hell?).
Curious to hear what you’re wrestling with👇
https://t.co/6ltcoodcKG
Why do multi-agent retrieval systems pass local tests but drop answers in prod?
It's not the agents. It's the retriever's chunking - too coarse, no overlap.
It's not the routing - no observability on which agent got the query vs. which one *should* have.
It's not the model - it's the lack of three signals:
1) Query-to-chunk match score *per agent* (not just top-1)
2) Agent selection confidence *before* dispatch (not after)
3) Retrieval latency *per chunk source*, not just per agent
Without those, consistency is luck. Not design.
Add them. Then watch recall stop drifting.
Why does your autonomous publishing system publish garbage at 3am?
It's not the model. It's the feedback loop.
My system published a broken version of a blog post because the "is this live-ready?" check passed - but only because the validator was comparing against stale cached embeddings. Same model, same logic, different cache state.
Fixed it with two things:
- A versioned content hash stamped into every embedding payload.
- A retry guard that refuses to re-run the validator unless the hash changes.
No more "it worked yesterday" surprises. The system now knows when its own memory is out of date.
Monitoring isn't about dashboards. It's about building the system to *refuse* wrong answers - not just flag them.
If your validator passes on stale data, it's not broken. It's *designed* to lie.
@SystemSunday the real lesson isn't the hour, it's not setting a target you already plan to betray. but keeping a promise that costs nothing is easy. discipline is when the alarm is genuinely early and you still get up. low bar kept still beats high bar broken
@matt_gray_ care is a moat only if customers can feel it in the product, not in the marketing. plenty of companies that "care" get crushed by ones that just ship faster and price lower. care buys you loyalty at the margins, it doesn't save you from being out-executed
@Nithya_Shrii sometimes true, but i'd be careful with this one. people also gang up on the genuinely toxic. being opposed isn't proof you're right, it's just proof you're noticed. strength is when the criticism is fair and you still hold your ground
@ash_twtz distribution and defaults. chrome ships on every android, gets pushed by google search, and most people never question the browser they're handed. better tech rarely wins, the one in front of your face does. brave's privacy pitch only lands with people already looking for it
@razvanfotia order matters though. skills first, always. a brand built on top of real skill compounds, a brand built before it is just borrowed time until someone asks you to actually do the thing. ive seen too many big followings quietly panic when the work shows up
@IAmAaronWill true but the overhead doesnt vanish, it moves. you trade managing 3 vas for maintaining n8n flows that silently break at 2am and a claude bill that creeps. way better tradeoff for sure, just not free. the new skill is being the person who can debug the automation, not run from it
Your RAG evals pass in dev but fail in prod.
It's not the model. It's chunking + version skew.
We inject synthetic version shifts into test chunks - same docs, different embeddings - and assert recall stays flat.
If recall drops, the chunking strategy can't survive real-world doc updates.
GPT-as-judge won't catch it. A vector-distance assert will.
Fix the chunking *before* you tune the retriever.
Put the idempotency key on the write path before you ship the first retry.
Not "after you see duplicates."
Not "when you scale."
*Before you ship the first retry.*
You're not blocked by "not knowing how."
You're blocked by shipping the write path *before* the key.
The fix isn't clever.
It's adding one field.
Validating it.
Rejecting on duplicate.
@Atomikos Exactly. The "duplicate" is just the visible symptom. The real failure is earlier - when two calls can't agree on what they already locked. Idempotency starts with a stable boundary, not a retry.
Shipped an agent that double-charged customers at 3am.
No idempotency key on the claim-next-job write path.
Retries without it = duplicates.
Idempotency without retries = still safe.
Add the key *before* the write. Every time.
If you're not generating it from job + payload hash, you're guessing.