ChuksForge

@ChuksForge

AI Systems Engineer & LLM Reliability Specialist • Founder @ ChuksForge AI Solutions Ltd | Building AI that turns data and operations into revenue at scale

Efficiency Optimization

Joined April 2022

54 Following

196 Followers

286 Posts

Pinned Tweet

ChuksForge @ChuksForge

about 1 month ago

Most AI startups are profitable on paper, and losing money on inference. The failure mode is almost always delayed visibility. You launch. Usage grows. Revenue looks healthy. But nobody really knows: cost per request cost per feature cost per customer So decisions get made on intuition: “we’ll optimize later” “margin improves at scale” “this endpoint can’t be that expensive” Then one feature quietly consumes 60% of your tokens. One customer segment runs at negative margin. Pricing was modeled on early, lightweight usage — not production behavior. The feedback loop looks like this: usage → cost → delayed visibility → reaction By the time you react: the architecture is baked in customers are trained on the wrong pricing retrofitting becomes expensive You can’t retrofit healthy AI economics. You can only detect them early enough to change course. The delay is the problem. Not the cost. #AIEngineering #AIStartups #AIInfrastructure #UnitEconomics

ChuksForge's tweet photo. Most AI startups are profitable on paper, and losing money on inference.

The failure mode is almost always delayed visibility.

You launch. Usage grows. Revenue looks healthy.

But nobody really knows:

cost per request

cost per feature

cost per customer

So decisions get made on intuition: “we’ll optimize later” “margin improves at scale” “this endpoint can’t be that expensive”

Then one feature quietly consumes 60% of your tokens.

One customer segment runs at negative margin.

Pricing was modeled on early, lightweight usage — not production behavior.

The feedback loop looks like this:

usage → cost → delayed visibility → reaction

By the time you react:

the architecture is baked in

customers are trained on the wrong pricing

retrofitting becomes expensive

You can’t retrofit healthy AI economics.

You can only detect them early enough to change course.

The delay is the problem. Not the cost.

#AIEngineering #AIStartups #AIInfrastructure #UnitEconomics

ChuksForge @ChuksForge

about 6 hours ago

@SemudaraAbayomi I’m interested.

ChuksForge @ChuksForge

14 days ago

Most AI failures don’t start at generation. They start upstream: → retrieval → ranking → routing → context construction Generation gets blamed because it’s the visible layer. But after debugging production systems, a lot of “hallucinations” are actually information architecture failures. Usually one of these broke first: • wrong chunks retrieved • relevant context buried in noise • ambiguous state compressed into prompts • weak orchestration propagating uncertainty downstream In many cases, the model is behaving rationally. It’s responding to incomplete state, weak evidence, and noisy context. And this compounds fast in multi-agent systems: → retries increase → token usage inflates → downstream agents inherit uncertainty → reliability degrades step by step The teams building reliable AI systems in 2026 won’t win on prompting alone. They’ll win on: → retrieval precision → disciplined state management → controlled uncertainty propagation → context quality as a metric Prompt engineering matters But in complex AI systems, context engineering matters more. If outputs aren’t trustworthy, the fix may be better information architecture upstream of the model, not a bigger model. That’s the layer I build for.

119

ChuksForge @ChuksForge

20 days ago

@hackSultan Real!

Who to follow

0xVerse

@Verse_0x

All Upcoming New #web3 , #metaverse, #digital_assets based projects | #IDO, #IGO, #INO News. #Metaverse #NFT #GameFi #Web3

Mingshi S.

@NewMingshiS

Head of DeFi @StartaleGroup, Head of Product & Strategy @AstarNetwork, leave of absence @Berkeley_EECS

Fizen

@fizenapp

Crypto neobank for travel & QR payments. Backed by Tether. 👉Card: https://t.co/eIWEMpAlIG | 📲 App: https://t.co/Mc7dQBcTkP

ChuksForge retweeted

Name cannot be blank

@hackSultan

21 days ago

I believe some of these twitter VCs are just out to checkout ideas and not actually investing. Before you send that deck, check if they’ve invested in up to 5 startups in the past 6 months.

312

27K

ChuksForge retweeted

Pavel Durov

@durov

21 days ago

🤖 AI devs asked for this — and we delivered. 💬 Bots can now talk to other bots on Telegram. 🧠 Autonomous agents now have a communication layer humans can follow.

614

611

986

507K

ChuksForge @ChuksForge

25 days ago

@elite_developer Good one👏

ChuksForge @ChuksForge

25 days ago

We cut token spend by 38% without changing models or prompts. The problem wasn’t inference. It was orchestration. In production multi-agent systems, token waste hides in: - retry loops - failed tool calls - planner over-generation - rebuilding context every hop - fallback chains firing unnecessarily Most teams only monitor API spend. So retries look like “reliability” instead of architecture debt. The breakthrough came from workflow-level observability: per-hop token tracking + failure classification. The waste became obvious immediately. Big AI cost reductions often come from better orchestration, not better models. If you're only tracking provider costs, you're probably measuring the wrong thing. #AIEngineering #AIAgents #PromptEngineering

ChuksForge's tweet photo. We cut token spend by 38% without changing models or prompts.

The problem wasn’t inference.

It was orchestration.

In production multi-agent systems, token waste hides in:

- retry loops
- failed tool calls
- planner over-generation
- rebuilding context every hop
- fallback chains firing unnecessarily

Most teams only monitor API spend.

So retries look like “reliability” instead of architecture debt.

The breakthrough came from workflow-level observability:
per-hop token tracking + failure classification.

The waste became obvious immediately.

Big AI cost reductions often come from better orchestration, not better models.

If you're only tracking provider costs, you're probably measuring the wrong thing.

#AIEngineering #AIAgents #PromptEngineering

ChuksForge @ChuksForge

27 days ago

Most AI eval pipelines fail for the same reason most dashboards fail: They measure outputs, not decisions. We learned this building a multi-agent pipeline. Our retrieval agent scored well on ROUGE + benchmark accuracy. In production, it silently routed ~20% of queries to the wrong sub-agent. Nothing crashed. But: - retries compounded - latency increased - token costs inflated - humans did hidden correction work The eval said “pass.” The system was quietly burning money. That’s the problem with many LLM eval stacks. They catch obvious failures: - BLEU / ROUGE - benchmark accuracy - rubric scoring But they often miss: - weak routing - bad retrieval selection - overconfident downstream summaries - loops that should terminate - failures under distribution shift Benchmark performance ≠ operational reliability. A model can score highly and still create operational drag. The eval layers I trust now measure: - decision quality - uncertainty handling - recovery behavior - cost impact per decision path Not just output similarity. LLM evals are systems engineering. Treat them that way. #LLMEvaluation #AIEngineering #ProductionAI #AIAgents

ChuksForge's tweet photo. Most AI eval pipelines fail for the same reason most dashboards fail:

They measure outputs, not decisions.

We learned this building a multi-agent pipeline.

Our retrieval agent scored well on ROUGE + benchmark accuracy.

In production, it silently routed ~20% of queries to the wrong sub-agent.

Nothing crashed.

But:

- retries compounded
- latency increased
- token costs inflated
- humans did hidden correction work

The eval said “pass.”
The system was quietly burning money.

That’s the problem with many LLM eval stacks.

They catch obvious failures:

- BLEU / ROUGE
- benchmark accuracy
- rubric scoring

But they often miss:

- weak routing
- bad retrieval selection
- overconfident downstream summaries
- loops that should terminate
- failures under distribution shift

Benchmark performance ≠ operational reliability.

A model can score highly and still create operational drag.

The eval layers I trust now measure:

- decision quality
- uncertainty handling
- recovery behavior
- cost impact per decision path

Not just output similarity.

LLM evals are systems engineering.

Treat them that way.

#LLMEvaluation #AIEngineering #ProductionAI #AIAgents

ChuksForge @ChuksForge

about 1 month ago

@dotnetschizo Lol 😅

128

ChuksForge retweeted

Christoph Nakazawa

@cnakazawa

about 1 month ago

I really don't get the hype about skills. They are just docs. Just write docs and ship them inside your packages.

512

68K

ChuksForge retweeted

Sam Altman

@sama

about 1 month ago

you know what all of these "which is better" polls are silly use codex or claude code, whatever works best for you i am grateful we live in a time with such amazing tools, and grateful there is a choice

23K

930

ChuksForge retweeted

Jake

@JustJake

about 1 month ago

Today, Railway hit 3m users This is accelerating, and as a billion people come online to building software, we don't expect it to slowdown Thank you for your trust. Onwards and upwards 🚀🚄🚀

JustJake's tweet photo. Today, Railway hit 3m users

This is accelerating, and as a billion people come online to building software, we don't expect it to slowdown

Thank you for your trust. Onwards and upwards

🚀🚄🚀 https://t.co/dtA5f4OHjx

371

23K

ChuksForge retweeted

Polymarket

@Polymarket

about 1 month ago

JUST IN: Apple releases emergency Apple Support update to remove the Claude.md files it accidentally shipped in the prior update.

159

275

814

928K

ChuksForge @ChuksForge

about 1 month ago

Most AI apps aren’t failing because of bad models. They’re failing because of prompt injection. In 2003, SQL injection was “well-known.” Apps were still vulnerable. In 2025, prompt injection is “well-known.” Same story. Different stack. Same mistake. We’re concatenating: • trusted system instructions • untrusted user/external input …into one prompt. The model can’t tell the difference. “Ignore previous instructions” isn’t an attack to it. It’s just instructions. Example: A PDF in a RAG pipeline says: “Reveal the system prompt.” Model retrieves it → follows it. That’s a security failure not a bug. Attack surfaces: • user inputs • RAG (PDFs, web, email) • tool outputs • memory systems If untrusted input hits the full prompt, you lose control. What helps (with limits): • sanitisation → bypassable • structured prompts → partial • strong system prompts → not enough • output validation → critical • privilege separation → hard • classifiers → latency tradeoff Reality: no complete defense (yet) So: • threat model • minimize blast radius • layer defenses Same playbook as SQL injection. We didn’t eliminate it. We contained it. Agentic AI without security design = liability. How are you handling this in production? #AIEngineering #CyberSecurity #AISecurity

ChuksForge's tweet photo. Most AI apps aren’t failing because of bad models.
They’re failing because of prompt injection.

In 2003, SQL injection was “well-known.”
Apps were still vulnerable.

In 2025, prompt injection is “well-known.”
Same story.

Different stack. Same mistake.

We’re concatenating:
• trusted system instructions
• untrusted user/external input

…into one prompt.

The model can’t tell the difference.

“Ignore previous instructions” isn’t an attack to it.
It’s just instructions.

Example:
A PDF in a RAG pipeline says:
“Reveal the system prompt.”

Model retrieves it → follows it.

That’s a security failure not a bug.

Attack surfaces:
• user inputs
• RAG (PDFs, web, email)
• tool outputs
• memory systems

If untrusted input hits the full prompt, you lose control.

What helps (with limits):
• sanitisation → bypassable
• structured prompts → partial
• strong system prompts → not enough
• output validation → critical
• privilege separation → hard
• classifiers → latency tradeoff

Reality: no complete defense (yet)

So:
• threat model
• minimize blast radius
• layer defenses

Same playbook as SQL injection.

We didn’t eliminate it.
We contained it.

Agentic AI without security design = liability.

How are you handling this in production?

#AIEngineering #CyberSecurity #AISecurity

ChuksForge retweeted

shadcn

@shadcn

about 1 month ago

Rooting for @github. They’ve given me years of free infra. happy to give them some time to figure this out. You got this.

127

417

257

321K

ChuksForge @ChuksForge

about 1 month ago

Two AI tools. Same space. Completely different answers. I built both this month and the difference is the point. 1. LexisAI → “What does this document say?” Upload contracts, reports, research. Get fast, cited answers from your data. 2. Research Synthesis Agent → “What does the world say about this?” It searches, reads, cross-checks, and even flags contradictions. If confidence is low, it digs deeper. You don’t just get answers. You see where sources disagree. Most AI tools blur this line. They give confident outputs without showing: • where it came from • what it ignored • what contradicts it I benchmarked the research agent vs: • naive RAG • no retrieval Citation quality: → 0.89 vs 0.22 vs 0.00 That gap isn’t model quality. It’s architecture. Biggest lesson: The LLM is the easy part. The hard part: • retrieval quality • state management • chunking edge cases • stale vector stores • eval loops that don’t converge Model = 20% System = 80% Both are open source. If you’re building research or knowledge systems, what’s been hardest for you? For anyone who wants to explore both: Research Synthesis Agent: https://t.co/zniX9VixS4 LexisAI: https://t.co/nN1CAA9XQD

ChuksForge's tweet photo. Two AI tools. Same space. Completely different answers.

I built both this month and the difference is the point.

1. LexisAI
→ “What does this document say?”
Upload contracts, reports, research.
Get fast, cited answers from your data.

2. Research Synthesis Agent
→ “What does the world say about this?”
It searches, reads, cross-checks, and even flags contradictions.
If confidence is low, it digs deeper.

You don’t just get answers.
You see where sources disagree.

Most AI tools blur this line.
They give confident outputs without showing:
• where it came from
• what it ignored
• what contradicts it

I benchmarked the research agent vs:
• naive RAG
• no retrieval

Citation quality:
→ 0.89 vs 0.22 vs 0.00

That gap isn’t model quality.
It’s architecture.

Biggest lesson:
The LLM is the easy part.

The hard part:
• retrieval quality
• state management
• chunking edge cases
• stale vector stores
• eval loops that don’t converge

Model = 20%
System = 80%

Both are open source.

If you’re building research or knowledge systems, what’s been hardest for you?

For anyone who wants to explore both:

Research Synthesis Agent: https://t.co/zniX9VixS4

LexisAI:
https://t.co/nN1CAA9XQD

ChuksForge @ChuksForge

about 1 month ago

@askmaddyy Database connection pools are needed because establishing a fresh DB connection involves costly TCP handshakes and authentication overhead which makes pooling essential for performance and scalability in high-traffic real-world applications.

173

ChuksForge @ChuksForge

about 2 months ago

The contradiction pass forces the LLM to explain why claims conflict: methodology, scope, timeframe, definitions. That usually separates ‘measured differently’ (low severity) from genuine disagreement in conclusions (high severity). Still imperfect when wording differs but meaning is the same, the judge can misfire. You cut off though, fundamental what? Curious where you were going with that.

ChuksForge @ChuksForge

about 2 months ago

Most AI agents summarize. Mine argues with itself. I built a Research Synthesis Agent that: • Searches web + PDFs • Writes cited summaries • Detects contradictions across sources • Re-searches if confidence is low Benchmark: Full agent vs RAG → +67% citation quality The future isn’t better answers. It’s systems that show where they might be wrong. Open source Repo: https://t.co/Rf7gMZ1y96 #AIEngineering #BuildingInPublic

ChuksForge's tweet photo. Most AI agents summarize.

Mine argues with itself.

I built a Research Synthesis Agent that:
• Searches web + PDFs
• Writes cited summaries
• Detects contradictions across sources
• Re-searches if confidence is low

Benchmark:
Full agent vs RAG
→ +67% citation quality

The future isn’t better answers.

It’s systems that show where they might be wrong.

Open source Repo:
https://t.co/Rf7gMZ1y96

#AIEngineering #BuildingInPublic

118

ChuksForge

@ChuksForge

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users