What does a 3,000-dimensional LLM judge evaluation actually look like?
Here's a real-world example: 30x100 complex judge evaluations against a single data point.
The target is an AI system that creates employment contracts.
The driving question was:
"Are the clauses in this contract actually enforceable across the globe?"
This is not a yes/no question.
Every evaluation needs its own judgement call.
Once you run a comprehensive evaluation stack, things change from "individual tests" to more like mapping the risk surface.
When teams don't get value out of LLM judges, it usually comes down to one of these root causes:
a) The judges not calibrated accurately enough. (More on this later.)
b) The evaluation stack is too narrow - addressed here.
Check the 2 min video showing how it works.
The stack created and orchestrated with @ScorableAI CLI.
In other domains - if this is not yet standard operating procedure across your high-stakes business questions, or your agent behavioral analyses, it should be.
Btw, any lawyers out there who would like to expand this analysis, I'd love to hear from you.
@HamelHusain I'm seeing the related problem where the AI builder and the QA responsible are the same party. Once you see this, the pattern is all over. The implementing party has no incentive to find holes in their AI delivery. Non-technical buyers must have an independent quality auditor.
@mattshumer_ Practically, this temp block is less important for most people than the cost jump expected for June 22nd. That was/is going to make Fable prohibitively expensive for casual users anyway. "proto-AGI is there but a bit too expensive" doesn't strike the viral chord.
We want to take your high-stakes AI systems from that 97% to 99.99%. In all the dimensions that matter. If you want to consistently improve it, score it.
@ScorableAI is moving to the next stage.
By 2028, most organizations will either have encoded their key knowledge work process KPIs for AI, or look like red tape machines in comparison to their peers.
AI won't always execute.
But it should always measure what matters.
That is also why we focus on judge calibration - and today we are launching our new brand look, symbolized by our signature progress bars.
Scaleups, enterprises, and government organizations use @ScorableAI to build real production-ready AI systems.
Every time they say "AI evaluations are not our bottleneck" all I hear is: "We are not trying to safely agentify our high-stakes business processes, and not worried about being outcompeted by those who do." No need for a seatbelt on a Segway. Not to mention the speedometer.
@karpathy VR is the logical conclusion before BCI, and not far off. Maximum bandwidth. People are just bad at imagining how to leverage VR medium as an interaction UI canvas. LLMs won't be.
Gemini was willing to go overboard on its praise, while the others were initially much more critical and wanted to do background research before committing to their assessments.
ChatGPT, Claude and Gemini all just declared Scorable's new Aegis the most rigorous choice for building AI evaluators for any AI system that needs to survive audit. Aegis automatically builds tightly calibrated LLM judges across a variety of real-world situations.
Disclaimer on the LLM comments (which you should always take with a grain of salt): The LLMs were basically told to just evaluate the detailed description of the algorithm, and then asked questions about it. I did not push them to poke holes in this case.
Yes, but the key to sustained progress is missing: AI-driven measurement.
With AI running the company OS, each workflow needs a measurable AI judge layer that knows what "better" means for you.
No pass/fail. A metric. A utility curve.
Scores for: Did the sales convo was follow your proven success patterns? Is the landing page upgrade more convincing than before? Was the support bot's reply awesome? Etc.
99% of people really do not understand abundance as Elon describes it.
The fundamental reason is that they don’t understand compound growth.
Same people who would probably pick 1 million dollars today over a penny that doubles in value every day for 30 days.
It’s a bad choice by the way. You lose out on millions.
Imagine if that doubling object was a labor producing robot instead of a penny.
Compounding labor. It’s actually crazy if you try and wrap your mind around it.
So Elon mentions Universl High Income and the midwits flip a lid.
“The elites won’t share”
You don’t get it. They won’t need to share. They will make everything so cheap, it is effectively free.
Charities will have immense resources to distribute.
Unfathomable intelligence will exist to help optimize production and distribution.
An unfathomably large labor pool will exist that operates on solar power exclusively.
The public work projects that are erected will be unseen before levels of breathtaking.
I think we are incredibly blessed to steward this new age of abundance.
Can you see it now?
Can you see the future?
@sama@gabeeegoooh The achieved level of detail and precision is impressive. Letters hold together. The field effectiveness of the design needs yet to be validated.
With ChatGPT images 2.0, you can trivially 1-shot any history topic as a comic version - such as the history of neural networks.
Massive implications for visual learning and hence expanding human understanding of complex topics.
I of course had to throw in the latest GPT model -> Surprisingly, Opus 4.7 had a tie with GPT 5.4. So, this was an algorithm that took many weeks to write and test for a seasoned data scientist at Scorable. I saw the algorithms. It's not slop. I've drawn my conclusions.
How does Opus 4.7 fare on a highly complex data science task? I put it to test against the work of one of our seasoned human data scientists, and had both 4.7 and 4.6 redesign some of these human-made algorithms for creating evaluators under certain real-world constraints.