RL Systems Mind the Gap:
Matching Trainer and Generator Throughput
RL Training Infrastructure, GRPO,
PipelineRL, Async RL, Policy Staleness,
RL Sandbox Infra, CPU Requirements,
TCO Analysis, Thinking Machines Tinker
https://t.co/yr5oH99h4B
Been thinking recently about how to improve credit assignment in long horizon RL?
Our new MosaicLeaks blog post describes our method to accurately value actions via situational rewards, improving our privacy-aware research agent over outcome-only rewards!
https://t.co/xCbOKEcdwa
@eisokant@eliebakouch Base models are amazing for RL tuning and reasoning. If a company is looking to shape some behaviour into their models I would expect them to start with a base stage.
Excited to share my recent work @ServiceNowRSRCH ! We introduce a new privacy-centric deep research dataset and show models frequently leak enterprise information.
However, training with dense _situational_ rewards efficiently learns to jointly optimize performance and privacy
MosaicLeaks is now on arXiv. The Mosaic Effect captures a simple idea: small fragments can look harmless alone, but become revealing in aggregate. Deep research agents can leak enterprise information in exactly this way.
1/9
The core idea: Enterprise agent privacy failures will not only come from copying private text. They can also come from the external actions agents take while trying to be useful. Privacy shouldn't come at the cost of utility, we can optimise for both.
8/9
Better reasoning does not have to mean longer reasoning.
Apriel OpenReasoner: fully reproducible multi-domain RL post-training using public datasets. 30-50% shorter traces, no quality trade-off.
@ServiceNowRSRCH@ehsk0@dvazquezcv@alexandredrouin