Ari Heljakka

Verified account

@AriHeljakka

// Root Signals | ML scientist (GenAI PhD) | Exp / human enhancement technologies

Joined November 2009

239 Following

274 Followers

282 Posts

11 days ago

What does a 3,000-dimensional LLM judge evaluation actually look like? Here's a real-world example: 30x100 complex judge evaluations against a single data point. The target is an AI system that creates employment contracts. The driving question was: "Are the clauses in this contract actually enforceable across the globe?" This is not a yes/no question. Every evaluation needs its own judgement call. Once you run a comprehensive evaluation stack, things change from "individual tests" to more like mapping the risk surface. When teams don't get value out of LLM judges, it usually comes down to one of these root causes: a) The judges not calibrated accurately enough. (More on this later.) b) The evaluation stack is too narrow - addressed here. Check the 2 min video showing how it works. The stack created and orchestrated with @ScorableAI CLI. In other domains - if this is not yet standard operating procedure across your high-stakes business questions, or your agent behavioral analyses, it should be. Btw, any lawyers out there who would like to expand this analysis, I'd love to hear from you.

0

0

0

1

46

12 days ago

@HamelHusain I'm seeing the related problem where the AI builder and the QA responsible are the same party. Once you see this, the pattern is all over. The implementing party has no incentive to find holes in their AI delivery. Non-technical buyers must have an independent quality auditor.

0

0

0

0

152

12 days ago

@mattshumer_ Practically, this temp block is less important for most people than the cost jump expected for June 22nd. That was/is going to make Fable prohibitively expensive for casual users anyway. "proto-AGI is there but a bit too expensive" doesn't strike the viral chord.

0

2

0

0

584

15 days ago

We want to take your high-stakes AI systems from that 97% to 99.99%. In all the dimensions that matter. If you want to consistently improve it, score it.

0

1

0

0

8

Who to follow

Verified account

Internet entrepreneur, producer, talent manager. 🌱 Founder @creatorstation, @oynasana, sem.

MTS @VinciPhysics, founding team @ uber ai labs & @ml_collective & geometric intelligence, ex toy designer @KiteandRocket 🌈✡️

AI, LLMs, and music. currently @upstageai @KAIST. prev: @genentech @gaudiolab @tiktok_us @spotify, @c4dm @qmul.

15 days ago

@ScorableAI is moving to the next stage. By 2028, most organizations will either have encoded their key knowledge work process KPIs for AI, or look like red tape machines in comparison to their peers. AI won't always execute. But it should always measure what matters.

AriHeljakka's tweet photo. @ScorableAI is moving to the next stage.

By 2028, most organizations will either have encoded their key knowledge work process KPIs for AI, or look like red tape machines in comparison to their peers.

AI won't always execute.
But it should always measure what matters. https://t.co/GHoyMTOjNt

1

2

0

0

14

15 days ago

That is also why we focus on judge calibration - and today we are launching our new brand look, symbolized by our signature progress bars. Scaleups, enterprises, and government organizations use @ScorableAI to build real production-ready AI systems.

1

1

0

0

13

about 1 month ago

Every time they say "AI evaluations are not our bottleneck" all I hear is: "We are not trying to safely agentify our high-stakes business processes, and not worried about being outcompeted by those who do." No need for a seatbelt on a Segway. Not to mention the speedometer.

1

2

1

0

31

about 2 months ago

@karpathy VR is the logical conclusion before BCI, and not far off. Maximum bandwidth. People are just bad at imagining how to leverage VR medium as an interaction UI canvas. LLMs won't be.

0

1

0

0

470

about 2 months ago

Gemini was willing to go overboard on its praise, while the others were initially much more critical and wanted to do background research before committing to their assessments.

0

0

0

0

29

about 2 months ago

ChatGPT, Claude and Gemini all just declared Scorable's new Aegis the most rigorous choice for building AI evaluators for any AI system that needs to survive audit. Aegis automatically builds tightly calibrated LLM judges across a variety of real-world situations.

AriHeljakka's tweet photo. ChatGPT, Claude and Gemini all just declared Scorable's new Aegis the most rigorous choice for building AI evaluators for any AI system that needs to survive audit. Aegis automatically builds tightly calibrated LLM judges across a variety of real-world situations. https://t.co/mcf14K8IrJ

AriHeljakka's tweet photo. ChatGPT, Claude and Gemini all just declared Scorable's new Aegis the most rigorous choice for building AI evaluators for any AI system that needs to survive audit. Aegis automatically builds tightly calibrated LLM judges across a variety of real-world situations. https://t.co/mcf14K8IrJ

AriHeljakka's tweet photo. ChatGPT, Claude and Gemini all just declared Scorable's new Aegis the most rigorous choice for building AI evaluators for any AI system that needs to survive audit. Aegis automatically builds tightly calibrated LLM judges across a variety of real-world situations. https://t.co/mcf14K8IrJ

1

1

1

0

70

about 2 months ago

Disclaimer on the LLM comments (which you should always take with a grain of salt): The LLMs were basically told to just evaluate the detailed description of the algorithm, and then asked questions about it. I did not push them to poke holes in this case.

1

0

0

0

21

about 2 months ago

Yes, but the key to sustained progress is missing: AI-driven measurement. With AI running the company OS, each workflow needs a measurable AI judge layer that knows what "better" means for you. No pass/fail. A metric. A utility curve. Scores for: Did the sales convo was follow your proven success patterns? Is the landing page upgrade more convincing than before? Was the support bot's reply awesome? Etc.

0

5

3

0

1K

AriHeljakka retweeted

2 months ago

99% of people really do not understand abundance as Elon describes it. The fundamental reason is that they don’t understand compound growth. Same people who would probably pick 1 million dollars today over a penny that doubles in value every day for 30 days. It’s a bad choice by the way. You lose out on millions. Imagine if that doubling object was a labor producing robot instead of a penny. Compounding labor. It’s actually crazy if you try and wrap your mind around it. So Elon mentions Universl High Income and the midwits flip a lid. “The elites won’t share” You don’t get it. They won’t need to share. They will make everything so cheap, it is effectively free. Charities will have immense resources to distribute. Unfathomable intelligence will exist to help optimize production and distribution. An unfathomably large labor pool will exist that operates on solar power exclusively. The public work projects that are erected will be unseen before levels of breathtaking. I think we are incredibly blessed to steward this new age of abundance. Can you see it now? Can you see the future?

5K

18K

2K

6K

46M

2 months ago

@sama @gabeeegoooh The achieved level of detail and precision is impressive. Letters hold together. The field effectiveness of the design needs yet to be validated.

AriHeljakka's tweet photo. @sama @gabeeegoooh The achieved level of detail and precision is impressive. Letters hold together. The field effectiveness of the design needs yet to be validated. https://t.co/y2fNYMNCds

0

1

1

1

295

2 months ago

I'll leave it to @SchmidhuberAI to make the more detailed version of that history (I told it to keep Seppo and Jurgen but gave no other instructions)

0

0

0

0

16

2 months ago

With ChatGPT images 2.0, you can trivially 1-shot any history topic as a comic version - such as the history of neural networks. Massive implications for visual learning and hence expanding human understanding of complex topics.

AriHeljakka's tweet photo. With ChatGPT images 2.0, you can trivially 1-shot any history topic as a comic version - such as the history of neural networks.

Massive implications for visual learning and hence expanding human understanding of complex topics. https://t.co/4vR8NeK81n

1

0

0

0

61

2 months ago

I of course had to throw in the latest GPT model -> Surprisingly, Opus 4.7 had a tie with GPT 5.4. So, this was an algorithm that took many weeks to write and test for a seasoned data scientist at Scorable. I saw the algorithms. It's not slop. I've drawn my conclusions.

0

1

0

0

47

2 months ago

How does Opus 4.7 fare on a highly complex data science task? I put it to test against the work of one of our seasoned human data scientists, and had both 4.7 and 4.6 redesign some of these human-made algorithms for creating evaluators under certain real-world constraints.

1

1

0

0

53

2 months ago

Gemini 3.1 and GPT 5.4 judged their solutions along 5 relevant axis. GPT was more lenient to humans, giving:

AriHeljakka's tweet photo. Gemini 3.1 and GPT 5.4 judged their solutions along 5 relevant axis. GPT was more lenient to humans, giving: https://t.co/UsPJKkdCJy

1

1

0

0

48

Last Seen Users on Sotwe

Trends for you

Most Popular Users