New Galatiq Research benchmark: frontier LLMs as commodity futures portfolio managers.
Grok 4.20 leads out-of-sample at +13.6% on the weekly listwise track. That’s more than 2x the matched-cutoff performance of Claude Opus 4.6 (+6.18%) and GPT 5.4 (+6.82%).
Claude Opus 4.7 looked competitive on the full year (+18.3%), but 76% of the backtest fell inside its training cutoff. Once you measure each model only on data it couldn’t have seen in training, Grok 4.20 is the clear winner.
Training-data cutoffs change the leaderboard. DM us for the full study.
We build end-to-end Grok deployments for enterprises.
Starting this week, we’re sharing the use cases in production and under development. Loan underwriting, voice agents, ops, marketing, plus what’s coming next.
Follow along.