Founder/CEO @ScaledCognition, former CVP Conversational AI @Microsoft, former Founder/CEO Semantic Machines, Shaser BioScience, and Voice Signal Technologies
LLMs lie. We build models that tell the truth.
Today, we're excited to announce our $100M Series A led by @vkhosla and @KhoslaVentures.
@profdanklein and I founded @ScaledCognition to solve the key challenge in AI, reliability.
THE MODEL CONSTELLATION GAMBIT - Because generalist models are non-deterministic, AI application layer companies cannot trust the output. To compensate, they build Constellations, complex model chains where a router classifies the input, a frontier model creates a draft, a supervisor model grades it, a reasoning model critiques it and reprompts to fix errors (as one example).
Some present this Rube Goldberg machine as a flex, proof of their sophisticated technology and a reason to persuade enterprise customers they can’t possibly do this themselves. In reality, It is an admission that their engine is unreliable.
Here is why the Constellation approach is an architectural trap:
The Physics of Stacked Error Rates - When you chain probabilistic models, errors do not cancel; they compound. A workflow with 5 steps where each model is 95% reliable (being generous), the math is unforgiving: 0.95^5=77%.
The Latency Spiral - In a live CX system, latency is the enemy. The ear easily detects pauses of 500ms. The router, foundation, supervisor and reprompting models all add latency. The network hops between the private lab hosted endpoint and wherever they’re hosting the other models adds time. Suddenly, the customer is waiting multiple seconds or longer for a reply.
Economic Implications (Tokens & Compute)- Reprompting is the most expensive and least certain way to try to address reliability. When a supervisor model detects a problem, the system must discard the first answer and pay for another model (which burns 3x+ the tokens) to try again. These systems are paying for the mistake and the correction (if it can even be corrected). Over millions of transactions, the cost delta between one-shot-correct and generate-check-regenerate-route is the difference between software margins and no margins.
Infrastructure Fragility- The Constellation relies on a fragile web of disparate providers. The base model might be an OpenAI endpoint. The supervisor model might be running in a separate tenant on Azure or AWS. If any single API in this chain degrades, the entire workflow fails. The system has introduced multiple points of failure.
The Prompt Maintenance Nightmare- Finally, there is the human cost. In a Constellation, you are not just prompting one model; you are maintaining a delicate equilibrium between many models. When one provider updates the model weights it will start confusing the others. The engineering team is trapped in an endless cycle of updating prompts for huge numbers of models to keep the Constellation aligned. It is a fragile equilibrium that breaks at scale.
The Constellation is a gambit, not a moat; an attempt to cast technological weakness as a reason customers should fear in-sourcing. And a fragile attempt to force a probabilistic poet to act like a deterministic banker.
I read somewhere that parenting is really just prompt engineering. As parents to two teenagers we’re constantly trying to figure out which token sequence will actually work to elicit the desired behavior, and which sequences will stick for more than ten minutes to get the model (our kids) to consistently adopt the prescribed agentic pattern. Like many LLM application devs, we find it’s often necessary to resort to ALL CAPS!!! And repeating the instructions at the top and bottom of the kid-prompt.
Ah yes, parenting is fun. But it also made me think about the fact that companies today using nondeterministic, scatterbrained, generalist LLMs with prompts as the only means of control are literally hiring the equivalent of (in our case at least) ADD teenagers to handle important functions like CX. It’s a bit wild, I mean I can only imagine how things would go if my kids were doing CX “wait, why did you cancel that guys flight?? It says right here in the policy you’re not supposed to do that in this situation” “IDK dad, I didn’t read that part, stop crashing out it’s not that deep” 😂
But it’s actually a real issue, for consequential workflows we need reliable systems that do the right thing every time not just occasionally. We’ve focused our research on building agentic LLMs with novel technology that enforces policies every time, not just occasionally, with the goal of creating systems that are actually reliable. APT-1 is able to do this, and is unlocking real value through reliable predictably as a result. I think this clip from Ilya makes the point perfectly.
New blog post - Prompt Trees: Training-time Prefix Caching. By the research team at @scaledcognition.
TL;DR: Training speedups of up to 70x on tree-structured data. Not 70%. _70x_.
https://t.co/EYD96dHAHk
(preprint version coming soon)
Excited to see old friends and make new ones at @NeurIPSConf this week.
We’re actively hiring across multiple roles! Feel free to DM me or stop by our table in the expo hall if you’re interested to know more about the cool stuff we’re building at @ScaledCognition. Many folks from our research team will be there and would love to meet you. You will also get a chance to play with a demo of our technology and get some cool swag! :)
Since founding Scaled Cognition, a neolab focused on building specialized, ultra reliable models for CX, I’ve heard a lot of what I’d call “LLM Maximalist” views from folks. Their basic premise is that the big private labs have reached escape velocity, their generalist models will do every conceivable unit of work with exceptional performance and there’s no need for specialization (or competition 🙂). I’ve never believed this, there are very few supporting examples historically. In my view the far more likely outcome is that generalist models will have enormous utility in many fields, but specialist models adapted to focus on particular kinds of applications (coding, CX, healthcare, biology…) will have meaningful adoption providing better performance and unit economics. Additionally, the big labs are literally existential threats to their own key customers. We have already seen in coding with Claude Code and Codex that the labs are trying to crush their own partners (Cursor etc.)- they want to own all the key spaces and need to to justify their valuations. It’s wild to watch these app layer companies feeding their key data to their big lab partners giving them the info they need to crush them. It’s madness. And not surprising that many are now trying to build their own models to escape this trap and have independence and viable margins. Of course building models is hard, and few have the skill sets or culture needed to incubate a successful research team. Satya explains that he sees the path forward as specialization as well and is skeptical that any one model will win. Will be interesting to see how things unfold…@ScaledCognition
Larry Ellison makes the point that models from major labs are all trained on the same internet text data- but to unlock real value they need to be trained from non-public enterprise data. But why? Surely it’s not that this private enterprise data has the missing information needed to achieve contextual representations of language- no, it’s because this data embodies the business logic and workflow signatures that represent the specific work the enterprise is looking to automate. Essentially he’s saying models need to specialize, they need training on the specific workflows they will be deployed against and that data does not exist on the internet. As this field evolves, specialist models that are designed for specific tasks will constantly outperform generalist models, with better scale and unit economics. That said, training on proprietary data has a multitude of challenges, it’s messy, and often not easily accessible for model training. Synthetic data gen for training is the answer, this is how you train a cognitive core that understands the workflows, but does not attempt to memorize the underlying data, instead it learns appropriate tool use and uses data connectors to pull the required data from licensed repositories. This approach is working extremely well for us in CX.
We’re actively hiring researchers! If you’re interested in building highly reliable specialized models for agentic use cases, come join us @ScaledCognition!
Our work ranges from low-level modeling advances to synthetic data generation and evaluation, and is directly impacting our end product and customers.
Feel free to DM me if you’re interested in learning more.
Most people don’t yet realize that systems based on general purpose LLMs are like building on jello. Models trained from the tangled mess of internet data and RL optimized for plausible sounding output are not well suited for workflow automation where precision and actual correctness matters. Going forward, for enterprise at least, we’re going to see the world move towards highly specialized models, trained on carefully calibrated (mostly synthetic) data that carry out specific tasks with exceptional performance, don’t hallucinate, and run at a fraction of the size and cost of their general purpose frontier cousins. This clip from Andrej lays it out well- just so happens this is also exactly what we do.
Excited to reveal what we’ve been working on- a new specialized agentic model, the Agentic Pretrained Trasformer, #1 on the Agentic benchmarks. We’ve assembled an incredible team of researchers at SC and we’re just getting started! You can read more below or on our website.
We’re Scaled Cognition, developing the first ever models trained specifically for agentic applications:
1. Our first system, APT-1, is now #1 on agentic benchmarks.
2. It was developed by a US team for a total cost of less than $11M.
3. Khosla Ventures led our seed round ($21M closed in 2023), and Vinod Khosla joined our board.
4. We use a fully synthetic, RL-based agentic data pipeline, no human-labeled data.
5. APT-1 is now available for early access.