Claude Code ultracode (xhigh) on FAST MODE seems to be working very slowly and creating bad code (but maybe it's on me... garbage in -> garbage out)
I've built a catalog of agentic engineering patterns, as well as some reflexive engineering primitives for a somewhat naive recursive AI implementation I've been working on. They're nothing fancy, just practical patterns that came out of real bugs we hit. If you're curious, check them out:
→ Agentic Engineering Patterns
https://t.co/1sRjdORvqN
→ Reflexive Engineering Primitives
https://t.co/ubCLqDr5Mg
@thsottiaux Idea: Internal agentic Chaos Monkey / bounty that inserts entropy to mess with Codex’s reliability. If it succeeds, you RCA it, fix the root cause, and reset our limits ad infinitum. Win-win. Deal?
It would be interesting for Codex to self-pace when approaching weekly limits - as well as asking users if they want to continue the current work or reprioritize.
@DimitrisPapail Agents do love strawmanning and drifting towards easier tasks. Feel like you periodically need to pull them back. I use “Don’t agent-strawman me” quite often.
In a world where humans make decisions driven by cognitive biases (eg loss aversion), will offloading far more choices to LLMs, lacking such utility functions, make outcomes more iatrogenic?
Some insights on using LLMs to forecast:
1. Do not ask “which LLM is best at forecasting?” Ask “best on which kind of question?” In my experiments, model rankings change by corpus/source and even by metric. A model can look best by Elo and worse by Brier.
2. Naive model ensembles are not automatically better. Averaging forecasts can lose to the best single model. The useful signal seems to be conditional routing by question/ source/model family, not “just take the mean.”
3. A model’s confidence explanation is not the same thing as calibration. Chain-of-thought style content can sound epistemically rich while being weakly related to actual Brier error.
4. Auxiliary channels can predict forecast error. Asking for “worry,” “confidence,” or related side-channel estimates can reveal whether the model’s probability is fragile. But the sign and usefulness are model-family dependent.
5. Never pool worry signals across model families without checking sign. One family’s “worry” can mean useful tail-risk awareness; another’s can mean generic uncertainty theater.
6. Forecasting skill decays with source/cutoff currency. Post-cutoff, source-fresh questions appear much harder. This suggests LLM forecasting skill is partly inherited from training-distribution currency, not pure reasoning.
7. The practical object is not a better prompt. It is a forecast validity layer. The system should decide when to trust, route, shrink, ensemble, abstain, or demand fresh evidence.
8. Decision utility is stricter than Brier. A forecast can be better-calibrated in aggregate but still hurt downstream decisions if the threshold policy is wrong.
9. The strongest near-term product implication: use LLMs as conditional forecasting instruments, not standalone forecasters. They can generate probabilities, side-channel diagnostics, decomposition, and routing features. The deployment layer has to arbitrate.
10. A possible "law": LLM forecasting errors are structured by carrier, source currency, and family-specific elicitation channels. If true, forecasting improvement comes from discovering the right conditional invariants, not from making one universal “superforecaster prompt".
Doing a bunch of follow-up experiments on
when human forecasting biases transfer to LLMs, when they disappear, and when they mutate into different failure modes.
Genesis of this was inspired after reading @PTetlock's work!