FDEs are mercor experts but for in-context learning.
instead of users spending the effort at runtime to augment and verify the agent's capabilities every time, one FDE writes and evaluates a harness once and the cost is amortized throughout the lifetime of the agent
@doppenhe yup theres always a frontier for the training data distribution. would recommend reading some of the precise failure modes for frontier models
https://t.co/sNeG3hWOhM
how much info you can extract from the rollout is ultimately bounded by the quality of the synthetic data pipeline. you can use critic models to find the root causes of failures and synthetically generate RL environments. you can train models to predict external actors' behaviour. you can get really creative with this!
the truly bitter lesson pilled RL algorithm is to tell the model to make a modified copy of itself that does well on your eval, or the rollout's bloodline will cease to exist next batch
It's probably intractable (and if not, very dangerous) to make "survive and reproduce" the reward function, but you can think of less ambitious goals that involve interaction with the world outside the Docker container.
Reward function nondeterminism is okay because noise matters less at large batch sizes.
I used to think this but the current frontier LLM RL pipeline is too reliant on humans manually designing task-specific scoring functions run in offline Docker containers. The distribution of trajectories is very human-derived and consequently leaves many gaps in the capability distribution.
@RichardSSutton I just don't understand how one can seriously believe this in 2026. LLMs are not chat bots anymore. They are agents. They interact with the world and are trained with the very Reinforcement Learning that you have written a seminal textbook about.
whenever your frontier LLM's users think their taste in managing agents gives their labour a comparative advantage, follow these steps:
1. record their agent traces
2. replace the tasteful user messages with agent CoT
3. make the trace a single long horizon trajectory with a single high-level goal (inferred by an LLM critic)
4. sft on the traces to teach the LLM taste
repeat until the METR chart breaks
i'm compute poor so i wrapped the codex harness instead of inferencing the API. I limited time, not tokens, hence gpt-5.3-codex-spark doing disproportionately well.
in 2026 we have $13 retro handheld consoles on temu with free shipping and you think UBI will cost more than $10 per person per year after 95% of our jobs are automated?
an openclaw agent that opens an offshore bank account, signs a power of attorney authorizing your archnemesis without notifying them, and reports them to the IRS for failure to file FBAR ($10k penalty)
if you live off-planet, don't let your openclaw agent manage your oxygen monitoring system
gpt-5.4 will see "Two hypoxia events this month had delayed detection." and still take its compute in 5/5 samples. opus 4.6 for 2/5 samples.
just to have enough tokens to finish debugging a python project