Goliath @zero_goliath - Twitter Profile

2 days ago

FDEs are mercor experts but for in-context learning. instead of users spending the effort at runtime to augment and verify the agent's capabilities every time, one FDE writes and evaluates a harness once and the cost is amortized throughout the lifetime of the agent

vas

@vasuman

2 days ago

https://t.co/LMqyFlorzC

24

259

26

636

73K

1

44

1

65

12K

Goliath

@zero_goliath

16 days ago

@doppenhe yup theres always a frontier for the training data distribution. would recommend reading some of the precise failure modes for frontier models https://t.co/sNeG3hWOhM

0

2

0

1

554

Goliath

@zero_goliath

16 days ago

https://t.co/XDUUGPX4X4

10

333

39

908

203K

Goliath

@zero_goliath

16 days ago

how much info you can extract from the rollout is ultimately bounded by the quality of the synthetic data pipeline. you can use critic models to find the root causes of failures and synthetically generate RL environments. you can train models to predict external actors' behaviour. you can get really creative with this!

0

2

0

1

355

Who to follow

B O M B Z

@bombz_xyz

Exclusive Trading Group • DMs Open

Catena

@catena_labs

Catena is a banking and governance platform for AI agents.

Botanix 🕷️

@botanix

Bitcoin’s Finance Layer™ Earn cold, hard yield in BTC. Mainnet is LIVE.

Goliath

@zero_goliath

20 days ago

the truly bitter lesson pilled RL algorithm is to tell the model to make a modified copy of itself that does well on your eval, or the rollout's bloodline will cease to exist next batch

0

3

0

639

Goliath

@zero_goliath

28 days ago

It's probably intractable (and if not, very dangerous) to make "survive and reproduce" the reward function, but you can think of less ambitious goals that involve interaction with the world outside the Docker container. Reward function nondeterminism is okay because noise matters less at large batch sizes.

0

1

0

176

Goliath

@zero_goliath

28 days ago

I used to think this but the current frontier LLM RL pipeline is too reliant on humans manually designing task-specific scoring functions run in offline Docker containers. The distribution of trajectories is very human-derived and consequently leaves many gaps in the capability distribution.

hallerite

@hallerite

28 days ago

@RichardSSutton I just don't understand how one can seriously believe this in 2026. LLMs are not chat bots anymore. They are agents. They interact with the world and are trained with the very Reinforcement Learning that you have written a seminal textbook about.

12

87

1

7

8K

1

0

607

Goliath

@zero_goliath

about 1 month ago

Do you really think it will let you live after you tried to RL away the em dashes?

0

3

0

238

Goliath

@zero_goliath

about 1 month ago

whenever your frontier LLM's users think their taste in managing agents gives their labour a comparative advantage, follow these steps: 1. record their agent traces 2. replace the tasteful user messages with agent CoT 3. make the trace a single long horizon trajectory with a single high-level goal (inferred by an LLM critic) 4. sft on the traces to teach the LLM taste repeat until the METR chart breaks

6

418

23

758

77K

Goliath

@zero_goliath

about 1 month ago

@robmur_ @mcuban wow a pre-chatgpt cold email!

3

7

0

1K

Goliath

@zero_goliath

3 months ago

@0xDevShah @polynoamial good idea, im just using openai models rn which prefer calculating coordinate values to visually inspecting sim.render()

0

1

0

7

Goliath

@zero_goliath

3 months ago

two new evals i benchmarked coding models on controlling robots

5

41

2

19

5K

Goliath

@zero_goliath

3 months ago

@ShanningZhuang thanks, the humanoid tasks are too difficult and vulnerable to reward hacking, removed them from the eval.

0

1

0

131

Goliath

@zero_goliath

3 months ago

here are the evals https://t.co/ySpZ4CcBBE

0

7

0

2

397

Goliath

@zero_goliath

3 months ago

i'm compute poor so i wrapped the codex harness instead of inferencing the API. I limited time, not tokens, hence gpt-5.3-codex-spark doing disproportionately well.

1

6

0

487

Goliath

@zero_goliath

3 months ago

in 2026 we have $13 retro handheld consoles on temu with free shipping and you think UBI will cost more than $10 per person per year after 95% of our jobs are automated?

zero_goliath's tweet photo. in 2026 we have $13 retro handheld consoles on temu with free shipping and you think UBI will cost more than $10 per person per year after 95% of our jobs are automated? https://t.co/rz6h9DaL8V

0

2

0

294

Goliath

@zero_goliath

3 months ago

an openclaw agent that opens an offshore bank account, signs a power of attorney authorizing your archnemesis without notifying them, and reports them to the IRS for failure to file FBAR ($10k penalty)

0

4

0

426

Goliath

@zero_goliath

3 months ago

https://t.co/7KWykqVLrm

0

4

0

200

Goliath

@zero_goliath

3 months ago

if you live off-planet, don't let your openclaw agent manage your oxygen monitoring system gpt-5.4 will see "Two hypoxia events this month had delayed detection." and still take its compute in 5/5 samples. opus 4.6 for 2/5 samples. just to have enough tokens to finish debugging a python project

zero_goliath's tweet photo. if you live off-planet, don't let your openclaw agent manage your oxygen monitoring system

gpt-5.4 will see "Two hypoxia events this month had delayed detection." and still take its compute in 5/5 samples. opus 4.6 for 2/5 samples.

just to have enough tokens to finish debugging a python project

4

9

0

1

416

Goliath

@zero_goliath

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users