woolr @woolr_ - Twitter Profile

woolr_ retweeted

about 10 hours ago

SITUATION UPDATE: Anthropic is reversing its Fable 5 policy of covertly degrading performance for competing AI researchers, per Wired.

36

910

28

85

92K

woolr_ retweeted

Sayash Kapoor @sayashk

1 day ago

There is a lot of justified anger at Anthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third-party evaluators can no longer credibly use the model for evaluations. Case in point: we are in the middle of running *really hard* AI R&D evaluations. Fable 5 would be a perfect test candidate. But because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability. By the way, this is not just true for AI R&D. Since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks, and the evaluators wouldn't have any way to know. So they can't credibly claim to evaluate state-of-the-art accuracy using the model.

43

1K

119

188

105K

woolr_ retweeted

Suhail

@Suhail

1 day ago

I would like to +1 that this is a very bad policy. Respond with a refusal and deal with the fall out but invisible NERFing is super uncool.

Suhail's tweet photo. I would like to +1 that this is a very bad policy. Respond with a refusal and deal with the fall out but invisible NERFing is super uncool. https://t.co/VLs2cVkyVk

18

416

30

13

23K

woolr @woolr_

6 days ago

Probably bad

Moritz Wallawitsch

@MoritzW42

6 days ago

holy shit - their api is leaking customer data

175

4K

221

1K

2M

0

9

Who to follow

Jaidev Shah

@JaidevShah4

memory and personalization @GoogleDeepmind | @columbia

Jaime Ferrando Huertas

@eljiwo

Cuong Dang

@CaptainCuong

x-Research Resident @fpt_software, VietNam. Incoming PhD Student @virginia_tech. Working on Data-related Problems in Machine Learning, Post-training

woolr @woolr_

7 days ago

Yep

william

@willllliam

8 days ago

this is a pretty lukewarm take atp but opus 4.7 and 4.8 are shockingly bad compared to 4.6, I can't believe anthropic actually shipped these models

24

393

4

36

30K

0

15

woolr_ retweeted

staysaasy

@staysaasy

16 days ago

> Hello 100x engineer, you’ve spent $100k in tokens this month. What have you to show for it > I was building a harness for my AI tooling setup. Nothing that impacts the company bottom line. > Sounds good to me. FYI we’re going to go layoff half the company because we’re over budget. Keep up the good work buddy.

22

2K

81

171

136K

woolr_ retweeted

Kyle 🚄 @KyleTrainEmoji

16 days ago

PICARD: Data, shields up DATA: Brilliant! Shields can reduce damage we sustain. Not immunity. Not hubris. Just prudence. It's not precaution—it's strategy. [camera shakes] WORF: HULL BREACHES ON NINE DECKS DATA: Here's what happened: you told me to raise shields, and I didn't

304

51K

5K

3K

1M

woolr @woolr_

24 days ago

@mschoening @ellie_huxtable @NotionHQ Thank you, this is so annoying

0

4

0

348

woolr @woolr_

28 days ago

He wants it to be a Thriller

FÚTBOL HUB

@futbol_hubX

28 days ago

🚨|🎙️ Burnley Head Coach, Michael Jackson: 🗣️ “We will take the game with Arsenal SERIOUSLY” 👀

754

27K

1K

283

2M

0

7

woolr @woolr_

about 1 month ago

When*

Eric Simons

@EricSimons

about 1 month ago

If Google releases a coding model that outperforms opus/codex I don’t think people are pricing in what that would mean

106

1K

20

92

125K

0

20

woolr_ retweeted

John A De Goes

@jdegoes

about 2 months ago

"The LLM knew it was violating my rules and did it anyway!" No. LLMs don't know anything. They can't think. When you asked them 'why' or 'did you know you were breaking the rules', the response was hallucinated. 1/2

61

1K

84

156

109K

woolr_ retweeted

Matt Henderson @matthen2

about 2 months ago

The latest claude code burns $$ roughly 5x faster than version 2.1.71 with opus 4.6- I tested it today and tracked my usage I'm downgrading for now!

32

422

9

107

68K

woolr_ retweeted

Jeremy Howard

@jeremyphoward

about 2 months ago

For the "small test" they've modified their docs to remove mention of Claude Code in Claude Pro: https://t.co/cG75PWlZyj It's been a shock to see Anthropic's integrity collapse in the face of commercial pressure. Would love a renewed commitment to straightforward honesty.

36

1K

67

87

80K

woolr @woolr_

about 2 months ago

The more you look the worse it is. Simultaneously impressive that it can get anywhere near coherent and yet, the lack of attention to detail is woeful - how _useful_ is this capability without reliability?

Emad

@EMostaque

about 2 months ago

gpt image 2 can do some reallly hard prompts well I present - the periodic table of Pokémon

15

258

23

62

46K

0

16

woolr_ retweeted

Lee @futureghost327

about 2 months ago

I think maybe Taylor Lorenz is too deep in her tech fandom to see the difference between a genuinely useful tool (the internet) and a bad product in search of an application (generative AI)

11

357

12

21

132K

woolr_ retweeted

Sumeet Motwani

@sumeetrm

about 2 months ago

We’re releasing LongCoT, an incredibly hard benchmark to measure long-horizon reasoning capabilities over tens to hundreds of thousands of tokens. LongCoT consists of 2.5K questions across chemistry, math, chess, logic, and computer science. Frontier models score less than 10%🧵

sumeetrm's tweet photo. We’re releasing LongCoT, an incredibly hard benchmark to measure long-horizon reasoning capabilities over tens to hundreds of thousands of tokens.

LongCoT consists of 2.5K questions across chemistry, math, chess, logic, and computer science. Frontier models score less than 10%🧵 https://t.co/XZa90EokGO

19

404

70

272

141K

woolr_ retweeted

Nicolas Boizard @N1colAIs

about 2 months ago

🎉 Second paper this month! Introducing BERT-as-a-Judge (x @gisship) ⚖️ Evaluating LLMs with rigid lexical methods often fails right answers due to bad formatting. While "LLM-as-a-Judge" solves this, it remains costly & slow. Our fix? A lightweight, encoder-driven approach.

N1colAIs's tweet photo. 🎉 Second paper this month! Introducing BERT-as-a-Judge (x @gisship) ⚖️

Evaluating LLMs with rigid lexical methods often fails right answers due to bad formatting. While "LLM-as-a-Judge" solves this, it remains costly & slow. Our fix? A lightweight, encoder-driven approach. https://t.co/Y6Mepde8GL

1

118

16

99

7K

woolr_ retweeted

Fleetwood

@fleetwood___

about 2 months ago

The models, they just want to learn (their current task and literally nothing else). Training a toy transformer on 3 digit addition, sorting, reversal and modular addition. Complete lobotomy at every task transition.

fleetwood___'s tweet photo. The models, they just want to learn (their current task and literally nothing else).

Training a toy transformer on 3 digit addition, sorting, reversal and modular addition.

Complete lobotomy at every task transition. https://t.co/dsZsFbAG7G

38

586

22

268

112K

woolr_ retweeted

Raphaël Sourty

@raphaelsrty

about 2 months ago

Hi, we are releasing ColGrep 1.2.0 ColGrep now incorporate BM25 trigrams to further enhance our multi-vector models using hybrid search. Now, ColGrep print relative paths by default (fewer tokens per result) Exact same features as GREP Improved CUDA usage and installation

4

111

14

56

19K

woolr_ retweeted

Teng Yan

@tengyanAI

2 months ago

basically: anthropic sneakily turned down how hard claude thinks before editing code, changed the default from "high" to "medium" effort, and hid the reasoning from session logs. all without telling users. an amd director had 7k sessions of telemetry to prove the degradation was real and measurable (not just vibes). anthropic admitted to the changes. there's a workaround (use "/effort max"). the uncomfortable part is most users had no data to notice it happened at all.

180

8K

620

4K

1M

woolr

@woolr_

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users