We (Emochi, with 10M+ users) stopped using benchmark scores for production decisions about 6 months ago.
The pattern was consistent: models ranking top on benchmarks underperformed in real usage. Models ranked lower drove better retention.
A few notes on what we learned:
Open-sourced Claude Code Web.
For people who can’t run Claude Code Client locally, or prefer a browser workspace over the TUI.
Runs with local Claude Code CLI or on a remote server via SSH tunnel.
Most importantly: you can change it into any skin you like.
Open-sourced Claude Code Web.
For people who can’t run Claude Code Client locally, or prefer a browser workspace over the TUI.
Runs with local Claude Code CLI or on a remote server via SSH tunnel.
Most importantly: you can change it into any skin you like.
Open-sourced Claude Code Web.
For people who can’t run Claude Code Client locally, or prefer a browser workspace over the TUI.
Runs with local Claude Code CLI or on a remote server via SSH tunnel.
Most importantly: you can change it into any skin you like.
Small tool that I built to help me give better interview:
input: JD + resume - output: questions for each round
input: transcript + JD - output: hiring suggestion with detailed reasons (so far 80%+ aligned with my decision)
We (Emochi, with 10M+ users) stopped using benchmark scores for production decisions about 6 months ago.
The pattern was consistent: models ranking top on benchmarks underperformed in real usage. Models ranked lower drove better retention.
A few notes on what we learned:
One takeaway for 2026: Eval infrastructure will determine product ceiling more than model selection.
The question isn't "which model is best." It's "which system learns fastest from real users."
Wrote up the full framework: https://t.co/RQXkDotJUM
We rebuilt the pipeline on three components:
1. Elo/TrueSkill on real conversations — filters out bad models in hours
2. Conversation-level A/B — not user-level (key distinction below)
3. Reward models trained on behavioral signals — closes the loop into training
Core issue: benchmarks optimize for "correct responses" on discrete tasks.
Consumer AI optimizes for "experiences that feel right" across continuous interactions. These objectives aren't aligned. At scale, they actively conflict.
@venturetwins My lesson learned from building Emochi is that the key is not only the model itself but the closed loop eval-feedback-iteration infrastructure. Happy to share more thoughts
Interviews are broken. resumes mislead.
We helped 100k+ people land jobs & scaled to $10M ARR. now we’re rebuilding hiring from scratch.
meet WorkTrial AI — where companies see the real work before they hire.
If your product relies on user-facing prompt features, this post is for you. I’m sharing hard-won lessons from 10,000+ prompt iterations across complex, structure-sensitive workflows where every percentage point of success rate mattered.
https://t.co/EiiDdLhHLE
Sharing my experience to help people write better production-ready prompts - check out "The First Production-Ready Prompts Guide"
https://t.co/EiiDdLhHLE via @LinkedIn
Sharing my experience to help people write better production-ready prompts - check out "The First Production-Ready Prompts Guide"
https://t.co/EiiDdLhHLE via @LinkedIn