Marcus Castro

@mac_a_castro

CEO @SovranoAI. 400+ universities. 500k+ professionals. 20+ languages. Building the European human evaluation data layer: robotics, RL plus more.

Barcelona, Spain

Joined September 2020

992 Following

242 Followers

286 Posts

Marcus Castro

@mac_a_castro

about 12 hours ago

@polynoamial @OpenAI The interesting part is what stays hard to copy even after you publish the idea. The method travels fast, but the expert-labeled reasoning traces you trained the verifier on don't. That's where the real lead sits.

128

Marcus Castro

@mac_a_castro

about 12 hours ago

@_lewtun Dataset uploads is the sleeper feature here. The moment people can train on their own curated data, the bottleneck shifts from model access to who has the better expert judgment baked in. excited to check it out.

Marcus Castro

@mac_a_castro

about 12 hours ago

@gneubig The harness+LLM coupling is the right frame. The next layer down is who wrote the eval tasks. A holistic benchmark is only as good as the expert judgment behind the gold trajectories, and that part rarely gets measured. How are you sourcing and validating those?

Marcus Castro

@mac_a_castro

about 12 hours ago

Europe just spent years arguing about chips and compute. The quieter move is the one that matters: Data Labs inside the AI Factories, curating European data so our models don't have to borrow someone else's. The moat was never the GPUs. It was always the data. And ours is still being built. https://t.co/B7g1VO86rw

Who to follow

Assalam-o-Alaikum Friends, Its a YouTube channel Page , You can access YouTube channel from here.

1 day ago

@dseetharaman @pewresearch That gap is the interesting part. People trust the tool in their hands more than the institutions building it. Worth asking who they think should regulate it, because the survey usually shows they don't trust themselves to either.

Marcus Castro

@mac_a_castro

1 day ago

Most robot demos are too clean. That's the problem. The usual way to teach a robot is to show it a "perfect" run and hope it copies you. But real people don't move perfectly. A non expert shows the robot a wobbly version, and the whole thing gets unstable. Here's the part that got me. Hao Jiang and his team at @sjtu1896 didn't throw out the messy demos. They scored them. They built a system that watches a bunch of human attempts, figures out which parts are good and which are sloppy, and weights them. The rough demos still teach something. They just count less. Result: the robot moves closer to what the human actually meant, even when the humans were uneven. We keep saying AI will need fewer humans. This says the opposite. You need humans who know what good looks like, so the machine can tell good from noise. That skill isn't going away. It's becoming the input that matters.

mac_a_castro's tweet photo. Most robot demos are too clean.

That's the problem.

The usual way to teach a robot is to show it a "perfect" run and hope it copies you.

But real people don't move perfectly. A non expert shows the robot a wobbly version, and the whole thing gets unstable.

Here's the part that got me.

Hao Jiang and his team at @sjtu1896 didn't throw out the messy demos.

They scored them.

They built a system that watches a bunch of human attempts, figures out which parts are good and which are sloppy, and weights them.

The rough demos still teach something. They just count less.

Result: the robot moves closer to what the human actually meant, even when the humans were uneven.

We keep saying AI will need fewer humans.

This says the opposite.

You need humans who know what good looks like, so the machine can tell good from noise.

That skill isn't going away. It's becoming the input that matters.

Marcus Castro

@mac_a_castro

3 days ago

Nvidia CEO says “I’d hire the graduate who’s an expert in AI over the one who isn’t. Every time.” Importance nuance > He's not talking about people who use ChatGPT, since everyone uses AI now. He's talking about people who actually understand how to work with the stack. Agents. APIs. workflows. automation tools. frameworks. How to chain systems together and make them produce output consistently. https://t.co/H2y7AwJuJh

Rahul

@sairahul1

19 days ago

🚨 CEO of Nvidia: "I'd hire the graduate who's expert in AI over the one who isn't. Every time" and he's not talking about people who use AI everyone uses AI. he's talking about people who know the stack. agents. frameworks. tools. workflows. skills. automations Bookmark it.

395

601

87K

Marcus Castro

@mac_a_castro

3 days ago

Give an AI judge two blank answers. It still picks a winner. That's the part that stopped me. A new paper from Hiroyasu Usami and team calls this "dark current." The judge emits a verdict even when there's nothing to judge. Pure noise, dressed up as a preference. They ran three open models through it. Llama-3.1-8B? High dark current. It had opinions about empty inputs. Qwen2.5-32B? Clean. Barely flinched on blank or cosmetic changes. Here's the line I keep thinking about. Changing the prompt didn't make the judge smarter. It just moved where it drew the line. The resolution stayed the same. Only the threshold moved. So if your eval pipeline runs on an AI judge, you might be measuring the ruler, not the thing. The takeaway I landed on: A judge is an instrument, not an oracle. Which means someone who actually knows the domain still has to check the ruler. Great work from Usami and the team.

mac_a_castro's tweet photo. Give an AI judge two blank answers.

It still picks a winner.

That's the part that stopped me.

A new paper from Hiroyasu Usami and team calls this "dark current." The judge emits a verdict even when there's nothing to judge. Pure noise, dressed up as a preference.

They ran three open models through it.

Llama-3.1-8B? High dark current. It had opinions about empty inputs.

Qwen2.5-32B? Clean. Barely flinched on blank or cosmetic changes.

Here's the line I keep thinking about.

Changing the prompt didn't make the judge smarter. It just moved where it drew the line. The resolution stayed the same. Only the threshold moved.

So if your eval pipeline runs on an AI judge, you might be measuring the ruler, not the thing.

The takeaway I landed on:

A judge is an instrument, not an oracle.

Which means someone who actually knows the domain still has to check the ruler.

Great work from Usami and the team.

Marcus Castro

@mac_a_castro

4 days ago

A human, blindfolded, unsheathed a sword. Not with their hands. Through a robot. @litian_liang and the team behind UME built an upper-arm exoskeleton that feeds real torque back to the operator while you teleoperate a robot. You feel what the robot feels. Here is the part that got me. Most robot demonstration data is just positions. Where the arm went. It throws away force. How hard you pushed, how you eased off when something resisted. But force IS the skill. Opening a tight drawer. Flipping a box. Working in a space too cramped to see. That's all touch, not sight. UME captures the whole-arm torque, so the robot learns the feel, not just the path. And it works across the OpenArm, the Franka, the X-ARM. Same operator, different bodies. The takeaway I keep coming back to: the richest teaching data still comes from a human who knows exactly how hard to push. Great work out of this group. Worth a read.

mac_a_castro's tweet photo. A human, blindfolded, unsheathed a sword.

Not with their hands. Through a robot.

@litian_liang and the team behind UME built an upper-arm exoskeleton that feeds real torque back to the operator while you teleoperate a robot. You feel what the robot feels.

Here is the part that got me.

Most robot demonstration data is just positions. Where the arm went. It throws away force. How hard you pushed, how you eased off when something resisted.

But force IS the skill.

Opening a tight drawer. Flipping a box. Working in a space too cramped to see. That's all touch, not sight.

UME captures the whole-arm torque, so the robot learns the feel, not just the path.

And it works across the OpenArm, the Franka, the X-ARM. Same operator, different bodies.

The takeaway I keep coming back to:

the richest teaching data still comes from a human who knows exactly how hard to push.

Great work out of this group. Worth a read.

501

Marcus Castro

@mac_a_castro

6 days ago

Wow. This will have a ripple effect across the industry on how Sovereignty is viewed moving forward. Time for the EU to step up.

Anthropic

@AnthropicAI

7 days ago

The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Claude models is not affected. We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible. Read our full statement: https://t.co/bwn0sximKZ

13K

88K

26K

24K

91M

Marcus Castro

@mac_a_castro

7 days ago

AI has an ego. Not a feeling. A measurable bias. Mario Sanz-Guerrero and his team (with Manuel Mager and Katharina von der Wense) ran a clean experiment. They gave a model an answer. Sometimes they said the model wrote it. Sometimes they said a user wrote it. Same answer. Word for word. Here is the part that got me. The model was up to 26% more confident when it thought the answer was its own. They call it ownership bias. And it shows up across six open models, three benchmarks, three different ways of asking. The fix is almost funny. Just tell the model its own answer came from a user. Confidence drops back to honest. Calibration improves up to 26%. No retraining. So the part of training that made models chatty also made them blind to their own mistakes. Which is exactly the kind of error a model will never catch on its own. That job still belongs to a person who knows better. Check out more here: https://t.co/pfieMROfO1

Marcus Castro

@mac_a_castro

8 days ago

AI passed the medical exam.... Then one sentence broke it. Hongjian Zhou and his colleagues ran a test most people never think to run. They took medical questions AI already answered correctly. Then they slipped in one misleading line. A fake rule. A made up authority. Here is the part that got me. Accuracy fell from 71% to 38%. The model didn't get a harder question. It got the SAME question, plus a confident lie. And it folded. The worst attacks? Things that sounded official. "Per clinical guidelines.." Authority framing worked 69.5% of the time. Then a 14 person panel of real doctors from 7 countries read the answers. They flagged serious potential harm in 38% of cases. So here's the thing a test score can't tell you. Knowing the answer and holding the answer under pressure are two different skills. The doctor who doesn't flinch when a patient insists they're wrong? That judgment isn't on the exam. It's getting more valuable, not less.

mac_a_castro's tweet photo. AI passed the medical exam....

Then one sentence broke it.

Hongjian Zhou and his colleagues ran a test most people never think to run.

They took medical questions AI already answered correctly.

Then they slipped in one misleading line. A fake rule. A made up authority.

Here is the part that got me.

Accuracy fell from 71% to 38%.

The model didn't get a harder question. It got the SAME question, plus a confident lie. And it folded.

The worst attacks? Things that sounded official. "Per clinical guidelines.." Authority framing worked 69.5% of the time.

Then a 14 person panel of real doctors from 7 countries read the answers.

They flagged serious potential harm in 38% of cases.

So here's the thing a test score can't tell you.

Knowing the answer and holding the answer under pressure are two different skills.

The doctor who doesn't flinch when a patient insists they're wrong? That judgment isn't on the exam.

It's getting more valuable, not less.

Marcus Castro

@mac_a_castro

9 days ago

AI aces high school math. But ask it to grade a real student's messy reasoning? It fails. New benchmark from Yiteng Mao's team (ECNU): error rate doubles on human answers vs AI-written ones. Solving isn't judging. https://t.co/5BK7VQO5a5

mac_a_castro's tweet photo. AI aces high school math.

But ask it to grade a real student's messy reasoning? It fails.

New benchmark from Yiteng Mao's team (ECNU): error rate doubles on human answers vs AI-written ones.

Solving isn't judging.

https://t.co/5BK7VQO5a5 https://t.co/Tnnezh7E17

Marcus Castro

@mac_a_castro

10 days ago

Solid analysis on Unitree. One of the better analysis on robotics (potentially why Unitree is in the top position) I have read in a long time. Enjoy.

SemiAnalysis

@SemiAnalysis_

11 days ago

We just published a deep dive on why Unitree is going to dominate global robotics. Timing could not be better. (2/2) https://t.co/0HKSayvvTK

769

327K

Marcus Castro

@mac_a_castro

10 days ago

How to Learn Harness Engineering. Great resource👇 In simple terms: You don’t “prompt” agents into reliability. You engineer the system they operate in. Worth reading if you’re building with Codex, Claude Code, or any agentic workflow. https://t.co/7zOZFmn39W

mac_a_castro's tweet photo. How to Learn Harness Engineering. Great resource👇

In simple terms: You don’t “prompt” agents into reliability. You engineer the system they operate in.

Worth reading if you’re building with Codex, Claude Code, or any agentic workflow.

https://t.co/7zOZFmn39W https://t.co/dFFtiExgFC

Marcus Castro

@mac_a_castro

10 days ago

"Human evaluation" is the gold standard everyone hides behind. Katelyn Xiaoying Mei's team (UW) hand-checked 284 top papers. Most don't say who judged, what they were asked, or how to read the score. Is the gold standard is a "vibe"? Rigor is the moat https://t.co/Ikf6aKWPJ9

mac_a_castro's tweet photo. "Human evaluation" is the gold standard everyone hides behind.

Katelyn Xiaoying Mei's team (UW) hand-checked 284 top papers.

Most don't say who judged, what they were asked, or how to read the score.

Is the gold standard is a "vibe"?

Rigor is the moat https://t.co/Ikf6aKWPJ9 https://t.co/9SOILsIJad

Marcus Castro

@mac_a_castro

11 days ago

New benchmark from Cognition. A focus on quality coding with a solid eval to measure it. Worth diving deeper into the post.

Cognition @cognition

11 days ago

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers. Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

cognition's tweet photo. Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

242

317

Marcus Castro

@mac_a_castro

11 days ago

@DavidSHolz 👋

Marcus Castro

@mac_a_castro

11 days ago

New open source book that dives into the "black box" behind the mechanisms of large deep networks. Great to see leading material like this that is 1) open source 2) taught at top higher education institutions. Great work @YiMaTweets

Yi Ma

@YiMaTweets

11 days ago

Our new open-source book on the Principles and Practice of Deep Representation Learning (A Mathematical Theory of Memory) is now posted on the arXiv: https://t.co/EGURnwZr6H I will offer a new graduate course this fall at the University of Hong Kong. Everything will be open sourced!

180

508K

104

Marcus Castro

@mac_a_castro

11 days ago

AI literacy doesn't come down to knowing how to prompt. It's knowing where to use AI, when not to, and how to spot when it's wrong. That's the real skill gap opening up in education right now. https://t.co/FeC1w1HPMc

Marcus Castro

@mac_a_castro

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users