Mapping 22,000 products to a 10,000-node taxonomy in days sounds like a context window problem.
It was. But that wasn't the hard part.
Two models ran in parallel — vector RAG + greedy tree traversal. Solo: ~60% accuracy each. Ensembled: they covered each other's blind spots.
Humans only touched the cases where models disagreed.
Result: 95%+ accuracy for Shopify Catalog's ground-truth data layer.
The hardest part wasn't the models. It was defining what "correct" even means when merchants intentionally break taxonomy rules.
Full breakdown: https://t.co/y9YT64fwg0
We’re building @Tendem_AI : a hybrid AI/Human agent targeting the complex long horizon work that is typically outsourced to freelancers. AI handles the heavy lifting (research, drafting, data), while vetted experts step in to ensure every result is business-ready.
Tendem sits on @Toloka’s infrastructure, using the same expert network and LLM QA tech quality control mechanisms that are used to deliver high quality data for frontier labs.
Available as a standalone product or via MCP, @Tendem_AI is 1.8x the quality of AI alone and 53% faster than traditional freelancers (whitepaper https://t.co/NPJWWvKqaU).
We are thrilled to announce our plans to integrate @Tendem_AI, our hybrid AI + Human engine, into the @NebiusAI ecosystem.
The goal? Human expert judgment as a native, programmable layer in your agentic workflows via MCP.
53% faster task completion. 21.3% quality improvement. Move from demo to production with confidence.
Learn more about our plans: https://t.co/rnHSARBUWQ
The horror stories are real. OpenClaw agents leaking credentials, overriding their own instructions, doing things they absolutely should not be doing.
Structured adversarial testing exists for a reason. Toloka runs it — 7 attack vectors, 300+ specialists, 6–8 hours.
Don't find out the hard way:https://t.co/fSYxg84DBs
Should we have human-in-the-loop or not?
We know even the best LLMs fail (I have many benchmarks to show). But can humans catch every issue? It's not possible. Things move too fast.
At @TolokaAI, we landed on a different conclusion:
Humans shouldn’t review every output at machine speed. They should be embedded where human really matters: decomposition, quality definition, verification of hard or ambiguous cases and pushing through truly complex work.
That’s what we built with @Tendem_AI. LLM acts as project manager. Domain experts handle what needs real judgment and execution. Layered QA (AI + human) catches the rest.
Results (94 real-world tasks, blind evaluation):
- Hybrid: 74.5% client-ready quality (now it's even better, need to rerun the eval)
- Human-only (Upwork): 53.2%
- AI-only (ChatGPT Agent): 40.4%
https://t.co/xasFhIqq8X
https://t.co/HWHQbKF1TA
Tendem can be used as is on our site: https://t.co/ODfYbaFbB8
Tendem can be integrated using MCP: https://t.co/xmSPrDToHi
Tendem can be integrated into your system as a virtual employee (stay tuned)
What do you think about this approach?
Random people don’t create trust.
What you actually need are people who understand the context and stand behind the result.
That’s why we built https://t.co/i8O294gdQn
Because the future isn’t AI replacing humans - it’s collaboration with the right humans, checking and helping instead of chaos.
@Tendem_AI
Been hammering Tendem lately and the AI + human combo is straight-up magic.
Machine crushes the grunt work, humans nail the nuance.
It's not just better results… it's a sweet symphony that pure AI or solo freelancers can't touch.
Don't sleep on @TolokaAI - super impressive product!
Toloka's new self-serve data annotation Platform is live.
Spin up a custom data project in minutes—just describe your task and our AI Assistant builds the full setup, from expert selection to pricing.
✅ 90+ expert specializations
✅ LLM‑powered QA to ensure annotations match your requirements
✅ Transparent, predictable pricing
✅ Fully self‑serve, quality‑first data for ML, LLMs, and AI agents
From image annotation to search queries and beyond—this is a new era for Toloka.
Try it now 👇
https://t.co/E9ZyLS37Bh