There is a lot of justified anger at Anthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third-party evaluators can no longer credibly use the model for evaluations.
Case in point: we are in the middle of running *really hard* AI R&D evaluations. Fable 5 would be a perfect test candidate. But because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability.
By the way, this is not just true for AI R&D. Since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks, and the evaluators wouldn't have any way to know. So they can't credibly claim to evaluate state-of-the-art accuracy using the model.
> Hello 100x engineer, you’ve spent $100k in tokens this month. What have you to show for it
> I was building a harness for my AI tooling setup. Nothing that impacts the company bottom line.
> Sounds good to me. FYI we’re going to go layoff half the company because we’re over budget. Keep up the good work buddy.
PICARD: Data, shields up
DATA: Brilliant! Shields can reduce damage we sustain. Not immunity. Not hubris. Just prudence. It's not precaution—it's strategy.
[camera shakes]
WORF: HULL BREACHES ON NINE DECKS
DATA: Here's what happened: you told me to raise shields, and I didn't
"The LLM knew it was violating my rules and did it anyway!"
No. LLMs don't know anything. They can't think.
When you asked them 'why' or 'did you know you were breaking the rules', the response was hallucinated.
1/2
For the "small test" they've modified their docs to remove mention of Claude Code in Claude Pro: https://t.co/cG75PWlZyj
It's been a shock to see Anthropic's integrity collapse in the face of commercial pressure. Would love a renewed commitment to straightforward honesty.
The more you look the worse it is.
Simultaneously impressive that it can get anywhere near coherent and yet, the lack of attention to detail is woeful - how _useful_ is this capability without reliability?
I think maybe Taylor Lorenz is too deep in her tech fandom to see the difference between a genuinely useful tool (the internet) and a bad product in search of an application (generative AI)
We’re releasing LongCoT, an incredibly hard benchmark to measure long-horizon reasoning capabilities over tens to hundreds of thousands of tokens.
LongCoT consists of 2.5K questions across chemistry, math, chess, logic, and computer science. Frontier models score less than 10%🧵
🎉 Second paper this month! Introducing BERT-as-a-Judge (x @gisship) ⚖️
Evaluating LLMs with rigid lexical methods often fails right answers due to bad formatting. While "LLM-as-a-Judge" solves this, it remains costly & slow. Our fix? A lightweight, encoder-driven approach.
The models, they just want to learn (their current task and literally nothing else).
Training a toy transformer on 3 digit addition, sorting, reversal and modular addition.
Complete lobotomy at every task transition.
Hi, we are releasing ColGrep 1.2.0
ColGrep now incorporate BM25 trigrams to further enhance our multi-vector models using hybrid search.
Now, ColGrep print relative paths by default (fewer tokens per result)
Exact same features as GREP
Improved CUDA usage and installation
basically: anthropic sneakily turned down how hard claude thinks before editing code, changed the default from "high" to "medium" effort, and hid the reasoning from session logs. all without telling users.
an amd director had 7k sessions of telemetry to prove the degradation was real and measurable (not just vibes). anthropic admitted to the changes. there's a workaround (use "/effort max"). the uncomfortable part is most users had no data to notice it happened at all.