Paco Guzmán

@guzmanhe

Head of Research @Handshake

San Francisco, CA

Joined March 2009

172 Following

358 Followers

428 Posts

guzmanhe retweeted

Jonas Mueller

@jomulr

9 days ago

Packed room to hear @alexgshaw and @ryanmart3n break down how @harborframework grew into *the* framework for RL environments. In our RLEval workshop at @CAISconf today, attendees tackled big open challenges in RLEs & Agent Evals + I shared the approach we take at @joinHandshake

jomulr's tweet photo. Packed room to hear @alexgshaw and @ryanmart3n break down how @harborframework grew into *the* framework for RL environments.

In our RLEval workshop at @CAISconf today, attendees tackled big open challenges in RLEs & Agent Evals + I shared the approach we take at @joinHandshake https://t.co/dfRbcJafg3

guzmanhe retweeted

Curtis G. Northcutt

@cgnorthcutt

9 days ago

Kudos to @anishathalye and @jomulr for co-chairing the RL agentic benchmarks workshop track for the inaugural ACM CAIS conference this week. We presented two separate Handshake AI Research papers in: (1) AI agentic systems - first evaluation of grader frameworks, and (2) AI benchmarks - first investment banking benchmark. Their posters had big crowds all afternoon. Great job!

cgnorthcutt's tweet photo. Kudos to @anishathalye and @jomulr for co-chairing the RL agentic benchmarks workshop track for the inaugural ACM CAIS conference this week.

We presented two separate Handshake AI Research papers in: (1) AI agentic systems - first evaluation of grader frameworks, and (2) AI benchmarks - first investment banking benchmark. Their posters had big crowds all afternoon. Great job!

Paco Guzmán

@guzmanhe

about 1 month ago

@jomulr @adithya_s_k exciting event! when are the keynote speakers going to be anounced?

guzmanhe retweeted

Jonas Mueller

@jomulr

about 1 month ago

@adithya_s_k Super useful resource, thank you for putting it together! Researchers working on RLE design & Agent Evals might consider submitting papers / attending the first-ever Workshop in this area at the upcoming ACM Conference on AI and Agentic Systems: https://t.co/FpWtMJsnv1

jomulr's tweet photo. @adithya_s_k Super useful resource, thank you for putting it together!

Researchers working on RLE design & Agent Evals might consider submitting papers / attending the first-ever Workshop in this area at the upcoming ACM Conference on AI and Agentic Systems:

https://t.co/FpWtMJsnv1 https://t.co/CBvZFc8rVP

Who to follow

thamar |

@thamar_solorio

Vice Provost, Faculty Excellence & NLP Prof @MBZUAI, Director @RiTUAL_Lab. Learner, friend, mother, partner, loves sunny days and live music. Views are mine.

Simone Scardapane

@s_scardapane

I fall in love with a new #machinelearning topic every month 🙄 | Researcher @SapienzaRoma | Author: Alice in a diff wonderland https://t.co/A2rr19d3Nl

IWSLT

@iwslt

The International Conference on Spoken Language Translation & SIGSLT. Join us for the 22nd edition of IWSLT on 31 July-1 Aug 2025, co-located with ACL!

guzmanhe retweeted

NAACL @naacl

about 1 month ago

Due May 6! NAACL RAF is accepting proposals for 2026-2027! Grants are available for NLP/CompLing initiatives across the Americas. https://t.co/T4a64PkYx2

guzmanhe retweeted

Aran Komatsuzaki

@arankomatsuzaki

about 1 month ago

The non-English tax is real. Sutton's Bitter Lesson, translated across languages and normalized to OpenAI English token count: Hindi: OpenAI 1.37×, Anthropic 3.24× Arabic: OpenAI 1.31×, Anthropic 2.86× Chinese: OpenAI 1.15×, Anthropic 1.71× Claude’s tokenizer charges a much higher linguistic tax.

arankomatsuzaki's tweet photo. The non-English tax is real.

Sutton's Bitter Lesson, translated across languages and normalized to OpenAI English token count:

Hindi: OpenAI 1.37×, Anthropic 3.24×
Arabic: OpenAI 1.31×, Anthropic 2.86×
Chinese: OpenAI 1.15×, Anthropic 1.71×

Claude’s tokenizer charges a much higher linguistic tax.

261

738

861K

guzmanhe retweeted

FleetingBits

@fleetingbits

about 1 month ago

some thoughts on the business of fine-tuning 1) thinking machines acquired workshop labs, which was doing llm personalization via fine-tuning 2) thinking machines probably acquired workshop labs to build out an application level product on top of their fine tuning apis 3) it's not quite clear whether this product is intended to be b2b or b2c; workshop labs was agnostic; thinking feels b2b so probably b2b 4) anyway, b2b fine-tuning has always been a hard market and no one has really seen success in it yet despite many attempts 5) the first issue is that fine-tuning tends to provide quickly depreciating value even if you get it right 6) if you use open models then you are already behind the frontier and so your fine tuning has to be able to bridge and then exceed that capability gap 7) and, once you do your fine-tuning, a new better model will be released soon anyway, so to retain any benefit you need a whole fine-tune and deploy cycle 8) and, this is assuming that the new model just doesn't already have all the capabilities of your data, such that the fine tuning still provides an additional advantage 9) the second issue is that the value of fine-tuning is expensive to capture in the first place; it is hard to collect data and do deployments 10) organizations are not designed to mine their own task data and create benchmarks; this requires special expertise companies don't tend to have 11) i also believe that this is process is likely to be organizationally difficult for a company to manage and work through; since it involves collaboration between different teams 12) also, if you want to use multiple different fine-tuned models across your application, you have just made that application much more complicated to update 13) so forward deployment seems very important to support enterprise fine-tuning in high value industries like drug development, semiconductors, finance, etc... 14) because you need to help solve their data collection, benchmark creation, model deployment and organizational difficulties in order to help them get value 15) so, my feeling is that anyone that wants to work on this should focus on high value, specialized industries, that the labs will not commoditize and which can afford forward deployment 16) and, you need to develop very comprehensive forward deployment that helps the customer ideate on problems, solve organization difficulties, and do process mining to collect the data 17) if you can solve this for very large scientific and financial customers, you can probably begin to build automated agents that would help extend this to a wider range of specialized enterprise customers 18) anyway, this would be my advice to thinky or anyone that wants to do enterprise fine tuning, develop a forward deployed team, with a set of enabling software solutions for process mining, rubric creation, etc...

153

158

15K

Paco Guzmán

@guzmanhe

about 1 month ago

@_arohan_ Good luck on whatever you’re building next

468

guzmanhe retweeted

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxesTex

about 2 months ago

This is surprising to me And it suggests that RL updates are not global enough 30% noise is a hell of a lot

guzmanhe retweeted

Jonas Mueller

@jomulr

about 2 months ago

📢 Call for papers: Workshop on Methods and Reinforcement Learning Environments for Evaluating AI Agents @ ACM CAIS 2026 (inaugural edition!) Topics include: - Design principles for effective RL Environments - Methods to evaluate Agents, esp. causal/interventional techniques

jomulr's tweet photo. 📢 Call for papers: Workshop on Methods and Reinforcement Learning Environments for Evaluating AI Agents @ ACM CAIS 2026 (inaugural edition!)

Topics include:
- Design principles for effective RL Environments
- Methods to evaluate Agents, esp. causal/interventional techniques https://t.co/Ixx3UAusAA

guzmanhe retweeted

Handshake @joinHandshake

about 2 months ago

AI models are incredible at coding and math. Labs like OpenAI and Anthropic solve verifiable domains by teaching models with tasks that have clear right or wrong answers, like "5/2." But in domains like finance or law, there is rarely a single right answer. There, labs turn to verifiers, complex systems that use AI, to grade the answers. But these verifiers can make mistakes! Is that an issue? In our latest research, we show that the verifier can be wrong 15–30% of the time, and the models will learn just as well. This means we can use these imperfect verifiers without losing performance!

Paco Guzmán

@guzmanhe

about 2 months ago

Full details in our arXiv preprint: https://t.co/O2enMkiQyt.

Paco Guzmán

@guzmanhe

about 2 months ago

An Imperfect Verifier is Good Enough One of the biggest bottlenecks in RLVR is building highly accurate verifiers. We asked: How good does a verifier actually need to be? Turns out… not perfect. Up to 15–30% noise in the reward signal barely hurts training, staying within 4pp of a clean baseline across three model families and two domains (code generation + scientific reasoning).

guzmanhe's tweet photo. An Imperfect Verifier is Good Enough

One of the biggest bottlenecks in RLVR is building highly accurate verifiers. We asked: How good does a verifier actually need to be?

Turns out… not perfect. Up to 15–30% noise in the reward signal barely hurts training, staying within 4pp of a clean baseline across three model families and two domains (code generation + scientific reasoning).

100

guzmanhe retweeted

Anish Athalye

@anishathalye

about 2 months ago

Does an imperfect verifier break reinforcement learning with verifiable rewards (RLVR)? Turns out it doesn’t! Why does this matter? As the world moves into reinforcement learning in semi-verifiable domains, perfect verifiers don’t exist. We added controlled and LLM-based noise to RLVR reward signals and found that up to 30% noise barely hurts training; performance stays within 4pp of the clean baseline. This research has already impacted how we build reinforcement learning environments at @joinHandshake. For a major benchmark we are launching tomorrow, we hill-climbed the verifier to 88% accuracy—above the 85% human inter-rater agreement—knowing from this research that this is good enough. With @andreas_plesner @guzmanhe

anishathalye's tweet photo. Does an imperfect verifier break reinforcement learning with verifiable rewards (RLVR)? Turns out it doesn’t!

Why does this matter? As the world moves into reinforcement learning in semi-verifiable domains, perfect verifiers don’t exist.

We added controlled and LLM-based noise to RLVR reward signals and found that up to 30% noise barely hurts training; performance stays within 4pp of the clean baseline.

This research has already impacted how we build reinforcement learning environments at @joinHandshake. For a major benchmark we are launching tomorrow, we hill-climbed the verifier to 88% accuracy—above the 85% human inter-rater agreement—knowing from this research that this is good enough.

With @andreas_plesner @guzmanhe

186

141

28K

Paco Guzmán

@guzmanhe

2 months ago

📷 Tuesday, April 8th @ 6 PM PT 📷 Shack 15 | Ferry Building, San Francisco Space is limited. If you'd like to attend, please RSVP here. https://t.co/wT3XNbfBZO

Paco Guzmán

@guzmanhe

2 months ago

Featured speakers: Andi Peng: Co-founder, Humans& Patrick Tammer: AI Data & Strategy Lead, Google Linda Lu: Responsible AI Researcher, Berkeley RDI

112

Paco Guzmán

@guzmanhe

2 months ago

The panel will dig into: How to benchmark AI's ability to perform economically meaningful work Current technical ceilings on agent capabilities Whether reasoning advances in agentic coding will transfer to high-stakes domains like finance and law

106

Paco Guzmán

@guzmanhe

2 months ago

Before ICLR, we're hosting an invitation-only research symposium in San Francisco focused on High GDP Agents: LLM and agentic research that measurably generates significant global economic value. 📷 Tuesday, April 8th @ 6 PM PT

204

Paco Guzmán

@guzmanhe

4 months ago

@anishathalye @CleanlabAI @joinHandshake Welcome!!

Paco Guzmán

@guzmanhe

4 months ago

@jomulr @CleanlabAI @joinHandshake So glad to have you on the team

Paco Guzmán

@guzmanhe

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users