Jean Michel A. Sarr | PhD CS
Research Engineer @ Google (views my own)
Building Distill in public: research memory, synthesis, and decision support
Accra, Ghana
Reading more wasn’t my problem. Remembering and compounding insight was.
I wrote about why I kept restarting things I cared about, and why I’m building Distill: a system that tracks papers, remembers what matters, and updates conclusions as new evidence arrives.
Building it in public:
https://t.co/Cl4sBf6r36
@jeremyphoward@Zai_org For the first time I went on https://t.co/Cxy28YRu25. I asked if the model had a laptop app similar to Claude code or Codex, and the model started talking about Gemini CLI. When I asked it who it was, it thought it was Gemini. While the app showed GLM 4.7. Not great.
Inspiring article, the part about choosing your own problems particularly resonated with me. Big tech has a dramatic influence in the AI research landscape because they often dictate which problems are worth solving. But there are many cool problems out there that are not popular yet.
The topic I'm tracking will change. Maybe in 3 months, maybe 6.
So the topic definition — thesis, scoring dimensions, source priorities, audience profile — lives in a single config file instead. The orchestration code knows nothing about what it's tracking. Pivoting to a new research area means swapping the file.
Zero code changes to go from "data advantages in AI" to any other domain.
@lemire If the problem is about the attitude of the contributing folks, then it surely wouldn't solve it. But if the problem is about solving code reviews as the bottleneck, then it would probably help.
Distill continuously ingests papers and scores each one for relevance.
The obvious approach: fetch everything, then decide. But that means paying to read every paper you'll throw away.
Instead, a cheap model runs on abstracts only — no full-text fetch. Anything below the relevance threshold gets dropped there. Full-text scoring only runs on what cleared the gate.
Reading is the most expensive step in the pipeline. Every architectural decision around it should treat it that way.
In preference learning, who judges quality and how those judgments update the policy are two distinct decisions that people often mix together.
• Human-written principles (e.g Constitutional AI) provide an interpretable judging mechanism, where explicit rules guide the model in labeling responses before those labels are used to train a reward model.
• Expert model judges such as GPT-4 generate preference labels that can either train a reward model for RL or feed directly into DPO for policy optimization.
• Self-judgment allows the model to prefer on response over the other, either by relying on emergent judging ability or by leveraging explicit judge training shown to outperform the emergent approach.
• Hybrid methods combine multiple sources of judgments, such as Constitutional AI mixing AI-labeled harmlessness with human-labeled helpfulness to balance safety and utility.
Decoupling who judges from how a judgment is done gives you orthogonal control knobs over two fundamentally different parts of the system
Have you found any other paradigm ?
https://t.co/EFG2GXYjSE
[12/n]
RLHF can't scale. Here's why 🧵
I just published a 4-part research series digging into its fundamental limits and mapping the synthetic alignment methods taking over. Starting an n-day daily thread walking through the evidence, one insight at a time.
Join me? Day 1/n: full roadmap
https://t.co/af6LyQOYIp
In some cases, you want to refine responses to generate natural preference pairs. How do you do that?
You can:
- Use heuristics (e.g., bigger models produce better responses).
- In an online setting where you continue training the reward model on new data, sample preference pairs from the current reward model.
- With DPO, use the policy itself as a reward model to directly rank its own answers.
- Use Constitutional AI: generate a critique of a bad response, then apply a human-written constitution to revise it.
- Use self-play: let the model engage with itself in multi-turn conversation and select the latest refined answer.
- Use tree search: generate multiple responses, select the best, critique it, and generate improved ones until satisfied.
Have you used any of these methods before?
https://t.co/LIBbQans20
[11/n]
I just arrived to San Diego for Neurips ! I am excited to discuss the latest research in synthetic data, alignment and more. Excited to meet new folks and discover the local food !