That’s what I thought too and have always had Claude and GPT on the biggest model with max reasoning and max thinking budget. But I’ve definitely seen Claude fail on trivial tasks. Also isn’t Anthropic very disincentivesed to tell people to use less tokens if there wasn’t some real reason
@JustinBleuel@dkundel Deep research seems much worse than GPT 5.5 pro extended. Can you merge them or make deep research better? Specifically for things that don’t have a clear end state like product research or finding the best of X (where the model also has to figure out what best means)
@AndrewCurran_ You should look into his "remarkably prescient" predictions. He did get some things right but he used the wrong models. Eg he predicted we would get to AGI/LEV by reverse-engineering the human brain and have nanobots scanning from inside the brain
@_sholtodouglas “You asked about A, they answered B, and didn’t mention C, which implies D. and now you’re asking me to help you write your POC proposal but what you should really be doing is asking a follow up question to D” (I’m happy to share my real evals privately)
@_sholtodouglas It’s not negotiations. It’s POCs for software. When you’re on a call with your potential customer, the conversation is led by you and what you ask. But since you can’t ask everything, it’s just as important what they didn’t say. So ideally a model can tell
I'm telling my llm to read only high quality travel content, like the USDA Foreign Agricultural Service annual reports on the food culture and market of that country
@KirkegaardEmil@jensenjeans since the study results are so dispersed, wouldn't it be more epistemically honest to use preCI rather than pooled means? I mean look at those I²