John deVadoss

@john_devadoss

co-Founder NeuralFabric acq. by @Cisco | co-Founder @IntWorkAll | Board @GBBC_io | General Manager @Microsoft | Phd RL research @UMassAmherst

Joined June 2019

2K Following

9.6K Followers

6 Posts

Pinned Tweet

John deVadoss

@john_devadoss

9 months ago

A Public AI Wealth fund, not 'basic income'. It is time for Congress to act. https://t.co/t40f66tO4S

61K

john_devadoss retweeted

Ryan Bahlous-Boldi

@RyanBoldi

14 days ago

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

RyanBoldi's tweet photo. Your RL post-training may be sabotaging your LLM’s test-time scaling!

Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*.
We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

847

119

784

210K

john_devadoss retweeted

Souradip Chakraborty

@SOURADIPCHAKR18

21 days ago

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL