a bit later than planned, but as promised: we’ve now released the data, traces, plots, and code behind our SpreadsheetBench / RLM-GEPA experiment.
everything is in the predict-rlm repo now.
https://t.co/ebTDlzCsZj (see examples/spreadbench)
@akshay_pachaar Great summary Akshay! For those who are curious, we have a production-focused & batteries included RLM distribution (soon shipping with GEPA optimizer) https://t.co/yGdTU8STfn
GEPA learnings: optimize instructions, freeze facts
A few months ago I bumped into GEPA. Was overjoyed because it helped me tag close to a million PDF pages accurately with a low cost LLM. So I pushed data set on psxGPT, a version worked, got to a 1,000 users.
But then my serious users started asking for structured data and ability to model more easily. I added those, hand crafted some rules to handle routing complexity, worked for a while but reliability sucked so I reverted to older commit.
As I started debugging and re creating my backend in a single Jupyter notebook and re running evals, gaps became clearer. Signature boundaries weren’t tightly defined. As I closed those one question remained:
What should GEPA optimize and what should it leave alone?
Where i ended: freeze facts, optimize all else.
Previously I never touched Signature docstring. But I notice letting GEPA change docstring is super effective (DSPy Signatures have a little docstring instruction in addition to input and output). For me what vs how distinction isn’t helpful. Fact vs instruction is clearer.
So schema is fact, column name fact, ticker set fact, list of canonical fact.
Everything else: instruction and therefore should be optimized. This has also led to me designing my context folder by Signature and separated by schema.md (which I don’t optimize) and a bunch of others which I do.
This process has raised another important question
When is the right time to run GEPA?
Only when signature boundaries are super tight and you’re very sure no obvious instruction is left out. Meaning whatever error is arising is just model drift and not a function of your own instruction sloppiness.
So for instance: if a stage requires inserting canonical names of a bank’s balance sheet and you didn’t insert those only to see model getting lost, fix that first.
Writing the spec, tests, defining signature boundaries, manually checking code blocks in Jupyter notebook is tedious yes but essential. I like to rebuild until cognitive load vanishes. That’s when I know that I actually “know”. Otherwise I’m winging it and that feels unsettling.
GEPA now has an API which is very easy to use. I prefer it over using the DSPy library. There are some similar optimizers out there too. I don’t understand the underlying genetic Pareto stuff well enough but this API works very well for me. Swapping DeepSeek V3.2 for Gemini 3.1 flash lite was a breeze.
indeed it's all just signatures (specs), modules ("harnesses", "inference scaling"), and optimizers (learning algorithms for prompts, weights, and hyperparameters)
can you imagine what would happen if there was a framework that divvied this up in 2022 and is still growing?
ok so the default DSPy.RLM is literally going to destroy this benchmark before the end of the day.
running now for sonnet 4.5...
🏆 Scoreboard (live)
RLM: 90/94 (95.7%)
Vanilla: 0/94 (0.0%)
anyone want to pay for the opus run? 😉
Launching, Open-source:
Self-Harness Recursive Language Model leveraging GEPA + @DSPyOSS
Call autonomous agents like a function.
Run your business like a software.