Thinking effort doesn't fix hallucination.
Even the best frontier model at matched HIGH still gets 24.2% of fields wrong on adversarial insurance docs. Going from default to HIGH buys 0-2pp per model.
https://t.co/Bw9enSaoYS
What makes this different: the generator emits the rendered document AND the ground-truth JSON in the same pass. No annotation step. Ground truth is authoritative by construction.
Full writeup, raw outputs, repo, 25-doc sample packet:
https://t.co/Bw9enSaoYS
GPT-5.5 reported $405.9M of revenue on a document that says $95M.
GPT-5.4 said $40.6M on the same page.
I built 148 adversarial insurance documents to test five frontier models. The numbers got weird.
Across all five models, 37% of extractions scored below 0.5 composite without ever tripping a catastrophic-error flag.
Production pipelines don't break loudly on these documents. They degrade silently, underneath whatever review threshold you trained your reviewers on.
I’m #hiring an individual contributor for a fully remote, global role at the intersection of vulnerability research, exploit development, and ML/AI — with a focus on fine-tuning open-weight #LLMs. 🧠
I’m not looking for an “LLM whisperer” or an “LLM pilot.” 🚫
I’m looking for someone who deeply understands post-training, data, evaluation, and how to make models reliable in real-world environments. 🔐
The application link is in the first comment. 🌍
#Hiring #LLM #AI #ML #FineTuning #CyberSecurity #llmwhisperer #llmpilot
@b1ack0wl Started a few companies (2 boot strapped 1 VC backed, new one bootstrapped but will very likely go raise) - happy to chat about it if it helps!
I think this is a mix of what @susantejuosho (https://t.co/YuwaMR2ocP) said, and also the changing demographic. YC used to target "older" founders who were used to the way things worked at big companies; the "you can just do things"/"go fast"/"do things that don't scale" was to help re-orient people from how things worked at big tech. But as they start having younger and younger people join, who do not have that context, the messaging is heavily muddled and distorted.
I think a lot of people are letting contexts grow too close to 300k+ tokens which is where capabilities start to drop off; but I think there's a good chance there is a "I built my early project on AI and it was FAST; it's now way more complex", and therefore giving them more issues that add further complexity
@spiritbuun Forcibly setting the effort level + forcing compaction well before 300k tokens and using subagents for many things has definitely kept things closer to how they used to be.
@MartinGTobias Latency I think is the bigger win for SLMs, and as companies have better data (or buy it) to train the models why rely on a third party when your own model is better/faster/cheaper.