Almost all AI model and agent progress is downstream from evals. Open weights post training for specific domains comes down to evals. Agent improvements in the applied AI layer is all about evals. Agentic enterprise deployments that actually can augment work is all about evals. It’s all evals.
This will become a core competency of any enterprise in the future. The companies that are able to best understand their own (and/or customers) workflows and how well agents participate in that work will be in the best position to actually drive real automation.
Introducing Stack.
The AI operating system that lets accounting firms take on more clients without hiring. Learns your firm's process, runs the close, posts the journals. Fully auditable.
We’re living through the biggest shift in accounting since the spreadsheet.
A common dynamic I observe with AI: it feels most impressive when you don’t know much about the subject, don’t care or don’t have a clear idea of what the you want.
This applies across design, code, legal, and more. If I don’t know code very well, every piece of code it writes feels very impressive.
Once you know what something should feel or look like, it becomes almost impossible to guide AI there. And you definitely can’t one-shot it.
Having some AI follow you into your zoom meetings or google meet for taking notes is the digital equivalent of showing up to a meeting with your fly down
My hope for AI agents is that building them feels like solving a Rubik's Cube in hindsight.
It looks hopelessly scrambled right up until the moment it's solved.