Full deep dive (3,800 words):
• Architecture explained
• • V1→V2→V3 evolution
• • What I’d do differently
• • Why defensive > aggressive for compounding
https://t.co/T7KzHCQyQI
If you’re building systematic trading systems and value honesty over hype, subscribe.
Next post: what V3 actually learned.
Early V3 signals (combos 2-5 of 84):
CAGR: +4% → +17% → +17% → +22% ⬆️
Sharpe: -1.84 → +0.12 → +0.31 → +0.42 ⬆️
Bear fraction in buffer: 24% → 33% → 31% ✓
The model is learning. Each combo better than the last.
Full training: ~13 days remaining.
I built a defensive RL trading system. During the Lehman 2008 replay, it lost 22%. The S&P lost 55%. The story of 18 months iterating on V1 → V2 → V3 below 🧵
V2 training revealed an unexpected problem:
Excessive turnover. ~20 trades/day. $40K in fees per simulated episode.
The Ulcer penalty alone didn’t reduce churn—the agent was rebalancing constantly to maintain its risk profile.
Lesson: reward shaping has interaction effects.
Why distributional RL?
The model doesn't predict an *expected* return.
It predicts a *distribution* of returns.
This makes it natively aware of tail risk - the worst-case scenarios get weighted in decision-making.
Standard RL maximizes mean. Distributional RL respects variance.
The benchmark I set:
Bridgewater All Weather lost ~22% in Lehman. Most mutual funds lost 50%. SPY lost 55%.
That 22% number became my target.
(Spoiler: V1 matched it. V3 is training to beat it.)
Most retail quants try to beat the market. I tried to *survive* it. Asymmetric math:
• 50% loss needs +100% to recover
• • 30% loss needs +43%
• • 20% loss needs +25%
The best defensive systems don't chase alpha. They preserve compounding.
Bridgewater All Weather lost 22% during the 2008 crash.
A retail system I built — running on a single RTX 3060 — matched that number on the same in-sample replay.
Tomorrow 14:00 CET: how I did it, what's broken, what I'm fixing in V3.
The engineering thread, not the pitch.
Stop optimizing for Sharpe.
Sharpe penalizes upside volatility equally with downside.
Your "perfect Sharpe 2" strategy might be one that never lets you win.
Defensive systems should optimize for Calmar (return / max drawdown) or Ulcer Performance Index.
The metric you choose IS the strategy.
Oct 9, 2007: S&P 500 hits all-time high.
Mar 9, 2009: S&P 500 down 55%.
A defensive RL system I built lost 22% in that same window.
I wrote 3,800 words about how — and what I learned trying to fix the slow-bear weakness 14 years later.
https://t.co/jlVmBY6374