Tim Michaud @TimGMichaud - Twitter Profile

16 days ago

Thinking effort doesn't fix hallucination. Even the best frontier model at matched HIGH still gets 24.2% of fields wrong on adversarial insurance docs. Going from default to HIGH buys 0-2pp per model. https://t.co/Bw9enSaoYS

TimGMichaud's tweet photo. Thinking effort doesn't fix hallucination.

Even the best frontier model at matched HIGH still gets 24.2% of fields wrong on adversarial insurance docs. Going from default to HIGH buys 0-2pp per model.

https://t.co/Bw9enSaoYS https://t.co/sZYuFx832T

0

86

Tim Michaud

@TimGMichaud

about 1 month ago

What makes this different: the generator emits the rendered document AND the ground-truth JSON in the same pass. No annotation step. Ground truth is authoritative by construction. Full writeup, raw outputs, repo, 25-doc sample packet: https://t.co/Bw9enSaoYS

0

29

Tim Michaud

@TimGMichaud

about 1 month ago

GPT-5.5 reported $405.9M of revenue on a document that says $95M. GPT-5.4 said $40.6M on the same page. I built 148 adversarial insurance documents to test five frontier models. The numbers got weird.

TimGMichaud's tweet photo. GPT-5.5 reported $405.9M of revenue on a document that says $95M.

GPT-5.4 said $40.6M on the same page.

I built 148 adversarial insurance documents to test five frontier models. The numbers got weird. https://t.co/m1Bi4tELlD

1

0

129

Tim Michaud

@TimGMichaud

about 1 month ago

Across all five models, 37% of extractions scored below 0.5 composite without ever tripping a catastrophic-error flag. Production pipelines don't break loudly on these documents. They degrade silently, underneath whatever review threshold you trained your reviewers on.

1

0

38

Who to follow

sferrini

@Simone_Ferrini

*OS Security Researcher & Director at @prdgmshift. Passionate about RE, fuzzing, hardware and low-level binary stuff. ʚଓ

Security Researching https://t.co/6SLPoePa0f

TimGMichaud retweeted

[email protected]

@daviddiaul

about 1 month ago

I’m #hiring an individual contributor for a fully remote, global role at the intersection of vulnerability research, exploit development, and ML/AI — with a focus on fine-tuning open-weight #LLMs. 🧠 I’m not looking for an “LLM whisperer” or an “LLM pilot.” 🚫 I’m looking for someone who deeply understands post-training, data, evaluation, and how to make models reliable in real-world environments. 🔐 The application link is in the first comment. 🌍 #Hiring #LLM #AI #ML #FineTuning #CyberSecurity #llmwhisperer #llmpilot

2

70

20

28

26K

Tim Michaud

@TimGMichaud

about 2 months ago

@GergelyOrosz Yeah I had this turn off on me before; SUPER annoying cause it's not obvious that it's off (or on!) :|

0

1

0

55

Tim Michaud

@TimGMichaud

about 2 months ago

@b1ack0wl Started a few companies (2 boot strapped 1 VC backed, new one bootstrapped but will very likely go raise) - happy to chat about it if it helps!

0

1

0

67

Tim Michaud

@TimGMichaud

about 2 months ago

I think this is a mix of what @susantejuosho (https://t.co/YuwaMR2ocP) said, and also the changing demographic. YC used to target "older" founders who were used to the way things worked at big companies; the "you can just do things"/"go fast"/"do things that don't scale" was to help re-orient people from how things worked at big tech. But as they start having younger and younger people join, who do not have that context, the messaging is heavily muddled and distorted.

0

2

0

111

Tim Michaud

@TimGMichaud

about 2 months ago

@GergelyOrosz Happened to us last year; was such a PITA we ended up cancelling.

0

33

Tim Michaud

@TimGMichaud

about 2 months ago

@HackingDave Honestly if 5.5 is an improved 5.3xhigh I think we might see a switch back towards OAI.

0

962

Tim Michaud

@TimGMichaud

about 2 months ago

@thefineprintesq Oh that's interesting; thanks for the info :)!

0

1

0

25

Tim Michaud

@TimGMichaud

about 2 months ago

I think a lot of people are letting contexts grow too close to 300k+ tokens which is where capabilities start to drop off; but I think there's a good chance there is a "I built my early project on AI and it was FAST; it's now way more complex", and therefore giving them more issues that add further complexity

0

1

0

178

Tim Michaud

@TimGMichaud

about 2 months ago

@spiritbuun Forcibly setting the effort level + forcing compaction well before 300k tokens and using subagents for many things has definitely kept things closer to how they used to be.

0

2

0

3

2K

Tim Michaud

@TimGMichaud

about 2 months ago

Codex (5.3xhigh) is a lot closer to CC than when I first used it; hope the gap continues to close.

1

0

152

Tim Michaud

@TimGMichaud

about 2 months ago

@MartinGTobias Latency I think is the bigger win for SLMs, and as companies have better data (or buy it) to train the models why rely on a third party when your own model is better/faster/cheaper.

0

1

0

95

Tim Michaud

@TimGMichaud

2 months ago

@HackingDave Not my experience on the Claude side, though neither of them have ever had anything more than ~mid level engineer FMPOV.

1

0

308

Tim Michaud

@TimGMichaud

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users