@polynoamial@OpenAI The interesting part is what stays hard to copy even after you publish the idea. The method travels fast, but the expert-labeled reasoning traces you trained the verifier on don't. That's where the real lead sits.
@_lewtun Dataset uploads is the sleeper feature here. The moment people can train on their own curated data, the bottleneck shifts from model access to who has the better expert judgment baked in. excited to check it out.
@gneubig The harness+LLM coupling is the right frame. The next layer down is who wrote the eval tasks. A holistic benchmark is only as good as the expert judgment behind the gold trajectories, and that part rarely gets measured. How are you sourcing and validating those?
Europe just spent years arguing about chips and compute.
The quieter move is the one that matters: Data Labs inside the AI Factories, curating European data so our models don't have to borrow someone else's.
The moat was never the GPUs.
It was always the data. And ours is still being built.
https://t.co/B7g1VO86rw
@dseetharaman@pewresearch That gap is the interesting part. People trust the tool in their hands more than the institutions building it. Worth asking who they think should regulate it, because the survey usually shows they don't trust themselves to either.
Most robot demos are too clean.
That's the problem.
The usual way to teach a robot is to show it a "perfect" run and hope it copies you.
But real people don't move perfectly. A non expert shows the robot a wobbly version, and the whole thing gets unstable.
Here's the part that got me.
Hao Jiang and his team at @sjtu1896 didn't throw out the messy demos.
They scored them.
They built a system that watches a bunch of human attempts, figures out which parts are good and which are sloppy, and weights them.
The rough demos still teach something. They just count less.
Result: the robot moves closer to what the human actually meant, even when the humans were uneven.
We keep saying AI will need fewer humans.
This says the opposite.
You need humans who know what good looks like, so the machine can tell good from noise.
That skill isn't going away. It's becoming the input that matters.
Nvidia CEO says “I’d hire the graduate who’s an expert in AI over the one who isn’t. Every time.”
Importance nuance > He's not talking about people who use ChatGPT, since everyone uses AI now. He's talking about people who actually understand how to work with the stack.
Agents. APIs. workflows. automation tools. frameworks. How to chain systems together and make them produce output consistently.
https://t.co/H2y7AwJuJh
🚨 CEO of Nvidia: "I'd hire the graduate who's expert in AI over the one who isn't. Every time"
and he's not talking about people who use AI
everyone uses AI.
he's talking about people who know the stack.
agents. frameworks. tools. workflows. skills. automations
Bookmark it.
Give an AI judge two blank answers.
It still picks a winner.
That's the part that stopped me.
A new paper from Hiroyasu Usami and team calls this "dark current." The judge emits a verdict even when there's nothing to judge. Pure noise, dressed up as a preference.
They ran three open models through it.
Llama-3.1-8B? High dark current. It had opinions about empty inputs.
Qwen2.5-32B? Clean. Barely flinched on blank or cosmetic changes.
Here's the line I keep thinking about.
Changing the prompt didn't make the judge smarter. It just moved where it drew the line. The resolution stayed the same. Only the threshold moved.
So if your eval pipeline runs on an AI judge, you might be measuring the ruler, not the thing.
The takeaway I landed on:
A judge is an instrument, not an oracle.
Which means someone who actually knows the domain still has to check the ruler.
Great work from Usami and the team.
A human, blindfolded, unsheathed a sword.
Not with their hands. Through a robot.
@litian_liang and the team behind UME built an upper-arm exoskeleton that feeds real torque back to the operator while you teleoperate a robot. You feel what the robot feels.
Here is the part that got me.
Most robot demonstration data is just positions. Where the arm went. It throws away force. How hard you pushed, how you eased off when something resisted.
But force IS the skill.
Opening a tight drawer. Flipping a box. Working in a space too cramped to see. That's all touch, not sight.
UME captures the whole-arm torque, so the robot learns the feel, not just the path.
And it works across the OpenArm, the Franka, the X-ARM. Same operator, different bodies.
The takeaway I keep coming back to:
the richest teaching data still comes from a human who knows exactly how hard to push.
Great work out of this group. Worth a read.
The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees.
The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance.
Access to all other Claude models is not affected.
We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible.
Read our full statement: https://t.co/bwn0sximKZ
AI has an ego.
Not a feeling. A measurable bias.
Mario Sanz-Guerrero and his team (with Manuel Mager and Katharina von der Wense) ran a clean experiment.
They gave a model an answer.
Sometimes they said the model wrote it.
Sometimes they said a user wrote it.
Same answer. Word for word.
Here is the part that got me.
The model was up to 26% more confident when it thought the answer was its own.
They call it ownership bias.
And it shows up across six open models, three benchmarks, three different ways of asking.
The fix is almost funny. Just tell the model its own answer came from a user. Confidence drops back to honest. Calibration improves up to 26%. No retraining.
So the part of training that made models chatty also made them blind to their own mistakes.
Which is exactly the kind of error a model will never catch on its own.
That job still belongs to a person who knows better.
Check out more here:
https://t.co/pfieMROfO1
AI passed the medical exam....
Then one sentence broke it.
Hongjian Zhou and his colleagues ran a test most people never think to run.
They took medical questions AI already answered correctly.
Then they slipped in one misleading line. A fake rule. A made up authority.
Here is the part that got me.
Accuracy fell from 71% to 38%.
The model didn't get a harder question. It got the SAME question, plus a confident lie. And it folded.
The worst attacks? Things that sounded official. "Per clinical guidelines.." Authority framing worked 69.5% of the time.
Then a 14 person panel of real doctors from 7 countries read the answers.
They flagged serious potential harm in 38% of cases.
So here's the thing a test score can't tell you.
Knowing the answer and holding the answer under pressure are two different skills.
The doctor who doesn't flinch when a patient insists they're wrong? That judgment isn't on the exam.
It's getting more valuable, not less.
AI aces high school math.
But ask it to grade a real student's messy reasoning? It fails.
New benchmark from Yiteng Mao's team (ECNU): error rate doubles on human answers vs AI-written ones.
Solving isn't judging.
https://t.co/5BK7VQO5a5
How to Learn Harness Engineering. Great resource👇
In simple terms: You don’t “prompt” agents into reliability. You engineer the system they operate in.
Worth reading if you’re building with Codex, Claude Code, or any agentic workflow.
https://t.co/7zOZFmn39W
"Human evaluation" is the gold standard everyone hides behind.
Katelyn Xiaoying Mei's team (UW) hand-checked 284 top papers.
Most don't say who judged, what they were asked, or how to read the score.
Is the gold standard is a "vibe"?
Rigor is the moat https://t.co/Ikf6aKWPJ9
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
New open source book that dives into the "black box" behind the mechanisms of large deep networks.
Great to see leading material like this that is 1) open source 2) taught at top higher education institutions.
Great work @YiMaTweets
Our new open-source book on the Principles and Practice of Deep Representation Learning (A Mathematical Theory of Memory) is now posted on the arXiv: https://t.co/EGURnwZr6H I will offer a new graduate course this fall at the University of Hong Kong. Everything will be open sourced!
AI literacy doesn't come down to knowing how to prompt. It's knowing where to use AI, when not to, and how to spot when it's wrong.
That's the real skill gap opening up in education right now. https://t.co/FeC1w1HPMc