@RayDalio Great organizations treat talent like a learning system: honest assessment, fast feedback, and roles aligned with strengths. The hard part is keeping both standards and empathy high.
The reframe that matters here: evals aren't QA, they're IP ➡️ your accumulated judgment, written down in a form a machine can optimize against.
Which means the hard part was never the model. It's that most orgs have never actually written down what "good" means. Evals just force the bill due.
Almost all AI model and agent progress is downstream from evals. Open weights post training for specific domains comes down to evals. Agent improvements in the applied AI layer is all about evals. Agentic enterprise deployments that actually can augment work is all about evals. It’s all evals.
This will become a core competency of any enterprise in the future. The companies that are able to best understand their own (and/or customers) workflows and how well agents participate in that work will be in the best position to actually drive real automation.
@nvidia Enterprise AI agents will be won at the workflow layer.
Models matter, but the real leverage comes from domain context, tool orchestration, security, and runtime reliability.
"Loop engineering" is having a moment now. AI plans and does the task, checks its own work, fixes it, repeats..♾
It works well for the tasks with objective ground truth. But in open-ended or creative work, the AI becomes its own judge and can quietly give itself an A+. So I dug into whether AI can actually judge AI in this research article.
The hardest to build auto-checks for — no single right answer, quality is multi-dimensional, exactly the work that stays human:
- Creative writing & storytelling
- Strategic decisions (product, investment, career)
- Aesthetic & design judgment
- Emotional intelligence & communication
- Long-term impact evaluation
Practical ways to strengthen the loop:
👉Add a second opinion (a different model, or the same one in critic mode)
👉Anchor to clear rules and rubrics
👉Spot-check with human feedback or real outcomes
Every check is a proxy. Your loop is only as strong as the weakest check it can't game. 🔧
So how are you using judges or loops in your projects?
@Kupilainen@TIME Exactly. AI is already today’s tool and the future is already in the workflow. The edge is learning to use it with judgment and real agency.
@sama The real unlock is moving security from detection to remediation~ If AI can reliably close the loop from finding vulnerabilities to patching them, defenders finally get compounding leverage🫡
Exactly~ and a practical sting: many of eval pipelines filter on the judge's confidence ("only count high-confidence calls"). If that signal is near chance, you're not denoising, you're selecting for noise. Read the reasoning trace, not the verdict's certainty, that's the whole game. Adding this to my stack, thanks for the pointer!
Overheard a little kid at the water edge: “it’s so pretty, it looks AI-generated.”
And I thought, this is the original. The models learned “beautiful” from places like this.🏞️
Growing up AI-native means the render is your baseline and reality is the thing that resembles it. I don’t think that’s bad. I just hope them know which one came first~
This should end the "final answer" era: agents graded on final output alone pass far more cases than trajectory eval reveals. That gap isn't noise, it's every wrong tool call, accidental delete, and lucky guess your eval never looked at. 🔍
#llmeval#aieval
Gorgeous chart, but it kind of buries the lede. 100+ suits and almost all of them are still just complaints. The actual legal reality is being set by maybe three resolved cases, and the chart already tagged them for you: SETTLED, WON, LOST. Everything else is "everyone's suing everyone," which is noise. The signal is the line the courts keep drawing: training on legally-bought works lands as fair use, the liability shows up when the data was pirated or the output competes with the original.
Most people are stuck arguing about motive: is it safety, or is it just kneecapping competitors? I saw the question as: how much does this model actually move the needle for a bad actor, versus what they could already do with open weights and tools that exist today? Nobody's put a number on it. So "it crosses a dangerous line" and "the safety case is hollow" are both basically assertions, not measurements. And the one real data point we have points the other way — 120+ security folks looked at the capability that triggered the ban and called it a standard defensive technique. That's the thing I wish we were arguing about.