hot take: most model routing today is astrology.
degradation and speed, sure. but the actual "which model for which task" decision? public benchmarks measure stuff that has nothing to do with your workload.
the companies getting this right built their own evals. everyone else is guessing with extra steps.
Everyone wants to build evals, but few people want to actually read through the data. You can’t have good model intuition without reading the tasks and traces. There is no substitute.
MiniMax performed impressively well on BankerToolBench, suggesting it is a seriously viable option for real-world investment banking workflows.
We're launching BTB v2 soon, excited to see how various models perform as the tasks become more rich and complex
so the bull case is every enterprise on earth has unique judgment worth codifying and someone has to do it. our whole pitch. $X00B+. one pushback: i think kirkland is smarter than the thread gives credit. going internal doesn’t lock them out of harvey later. portable judgment + ride any model curve + still partner with vertical players when ready. sounds like leverage to me. every enterprise should be building applied evals.
some thoughts on kirkland building its own harvey
1) kirkland is spending $500m over four years in order to build its own internal ai legal tools; kirkland intends to spend $100m this year
2) i suspect that kirkland is doing this because they have told themselves that they have valuable data and because they want to appear differentiated
3) i think the first issue is that kirkland probably does not have differentiated data from other elite law firms; at least, not at the level a harvey would absorb
4) all the elite firms probably have similar internal workflow data and so long as some of them defect, that is enough to commoditize the data kirkland wants to use for its platform
5) and, to the extent that they do have different internal workflows, harvey and legora will end up representing a better version of them and this will put kirkland at a disadvantage
6) moreover, companies like kirkland will have difficulty building their internal legal platforms because they do not have experience with software development
7) and, there are both cultural and structural issues with them managing software developers, like they cannot give non-lawyers equity in the firm due to regulation
8) so, i think firms like kirkland are better off using tools like harvey and legora and then looking to focus on where their value really is now: client relationships, local knowledge (litigation, regulation) and legal r&d (novel structures, etc...)
9) anyway, this seems to me like a phenomenon that ai creates across a lot of industries, where firms that were previously vertically integrated become unbundled due to ai because part of the intelligence gets moved to the labs or otherwise gets commoditized
10) and so, a new set of companies are created whose job it is in order to provide services complementary to the labs: forward deployed like harvey and legora and data providers like mercor, surge and handshake
This is effectively the #1 problem for AI agents in the enterprise.
As we go from agentic coding (where a large amount of context is in the code base, and users are technical enough to get the rest to the agent easily) to a world of knowledge work agents, the context problem becomes much more acute.
We see this every day with customers at Box. For existing digital knowledge, it’s often fragmented across legacy systems or environments that don’t play nice with agents, and have access controls that don’t map to the real work that needs to be done, which become a huge hurdle for getting agents the context they need. This has to all get moved to modern, secure cloud environments.
But also, companies often haven’t captured and digitized some of the critical context that agents need to work with. Decisions, processes, and workflows often live in people’s heads and tribal knowledge that need to get turned into unstructured data for agents.
This is actually one of the biggest points of leverage for applied AI companies, because they can work to specialize in getting agents exactly the information and domain expertise they need. But it’s also one of the reasons why FDEs and new system integrator plays will also work so well right now.
The companies that figure this out will be able to get the most out of AI going forward.
Imagine replacing 90% of your employees with a team of geniuses who have no idea how your company operates.
Total chaos. Nothing works.
That’s what AI feels like today.
The missing piece is extracting all the domain knowledge from people’s heads and providing that as structured context to the models.
data
former lawyers becoming legal engineers. former bankers becoming finance engineers. the expert isn’t getting automated. the expert is getting hired to teach the model.
the most advanced AI use cases are the map. code needs more data than ever. law is following. the rest of the economy is just earlier on the same curve.