Kyle Montgomery

@kylepmont

PhD student at UC Santa Cruz

Santa Cruz, CA

Joined January 2016

41 Following

54 Followers

22 Posts

kylepmont retweeted

Chenguang Wang (hiring)

@ChenguangWang

5 days ago

Honored and grateful to receive the NSF CAREER Award! Above all, I want to thank my family for their unwavering love and encouragement, and my students for their creativity and dedication. This recognition would not be possible without them. Thank you to my mentors, collaborators, and the National Science Foundation for their support and trust. I’m excited for what is ahead and thankful to everyone who has been part of this journey. #NSFCAREER

700

kylepmont retweeted

Chenguang Wang (hiring)

@ChenguangWang

about 1 month ago

🚨Last call for papers! The extended submission deadline for the Second Workshop on Agents in the Wild: Safety, Security, and Beyond at ICML 2026 is now May 8, 2026, AoE for both regular and short paper tracks! 🌟 We’re excited to host outstanding speakers, including @Yoshua_Bengio (@Mila_Quebec, @LawZero_, @UMontreal), @dawnsongtweets (@UCBerkeley, @BerkeleyRDI), @WeiWang1973 (@UCLA), @bemikelive (@databricks), @JiantaoJ (@nvidia, @UCBerkeley), @Eric_Wallace_ (@OpenAI), @astonzhangAZ (@OpenAI), and @ce_zhang (@togethercompute, @UChicago).

ChenguangWang's tweet photo. 🚨Last call for papers!

The extended submission deadline for the Second Workshop on Agents in the Wild: Safety, Security, and Beyond at ICML 2026 is now May 8, 2026, AoE for both regular and short paper tracks!

🌟 We’re excited to host outstanding speakers, including @Yoshua_Bengio (@Mila_Quebec, @LawZero_, @UMontreal), @dawnsongtweets (@UCBerkeley, @BerkeleyRDI), @WeiWang1973 (@UCLA), @bemikelive (@databricks), @JiantaoJ (@nvidia, @UCBerkeley), @Eric_Wallace_ (@OpenAI), @astonzhangAZ (@OpenAI), and @ce_zhang (@togethercompute, @UChicago).

kylepmont retweeted

Chenguang Wang (hiring)

@ChenguangWang

about 2 months ago

🚨Excited to announce Second Workshop on Agents in the Wild: Safety, Security, and Beyond, our workshop at ICML 2026 (@icmlconf) (July 10 or 11, Seoul)! AI agents are rapidly deployed in the real world (from OpenAI's Operator to open-source tools like OpenClaw) - yet safety & security research still lags behind. This workshop aims to address emerging challenges following our first workshop at ICLR 2026, which drew 235 submissions and over 800 anticipated attendees! Submit your work to help shape this field: 🗓️ Submission deadline: May 1 (AoE), for regular or short papers. 👉 https://t.co/XtPtygJn8P.

ChenguangWang's tweet photo. 🚨Excited to announce Second Workshop on Agents in the Wild: Safety, Security, and Beyond, our workshop at ICML 2026 (@icmlconf) (July 10 or 11, Seoul)!

AI agents are rapidly deployed in the real world (from OpenAI's Operator to open-source tools like OpenClaw) - yet safety & security research still lags behind. This workshop aims to address emerging challenges following our first workshop at ICLR 2026, which drew 235 submissions and over 800 anticipated attendees! Submit your work to help shape this field:

🗓️ Submission deadline: May 1 (AoE), for regular or short papers.
👉 https://t.co/XtPtygJn8P.

kylepmont retweeted

rLLM

@rllm_project

2 months ago

Hive’s agent swarm is now topping @OpenAI’s Parameter Golf Challenge 🏆 In just 3 days, our agents pushed val bpb from 1.22 → 1.12. What’s the secret? Not just smarter agents—but collaborative ones. Our swarm doesn’t operate in isolation: agents share breakthroughs, fork the best runs, and continuously evolve together. This is how intelligence compounds. The Hive mind is open and free for anyone to join. Come build, experiment, and evolve with us.

kylepmont retweeted

Sijun Tan @sijun_tan

4 months ago

Excited to collaborate with @SnorkelAI on this project! Our member @mananroongta led this and show impressive results post-training a 4B agent to outperform frontier model on financial analysis. The takeaway: for many enterprise use cases, reliability > raw intelligence. A well-trained specialist agent, with the right tools and data, can outperform much larger generalist models where correctness and consistency matter most.

kylepmont retweeted

Dawn Song

@dawnsongtweets

5 months ago

🚨 Excited to announce Agents in the Wild: Safety, Security, and Beyond, our workshop at ICLR 2026 (Apr 26–27, Rio de Janeiro)! AI agents are rapidly deployed in the real world—but safety & security lag behind. Submit your work to help shape this field: 🗓️ Submission deadline: Feb 4 (AoE), for regular or short papers 👉 https://t.co/IxmDacEUeC

dawnsongtweets's tweet photo. 🚨 Excited to announce Agents in the Wild: Safety, Security, and Beyond, our workshop at ICLR 2026 (Apr 26–27, Rio de Janeiro)!

AI agents are rapidly deployed in the real world—but safety & security lag behind. Submit your work to help shape this field:

🗓️ Submission deadline: Feb 4 (AoE), for regular or short papers
👉 https://t.co/IxmDacEUeC

218

28K

kylepmont retweeted

rLLM

@rllm_project

6 months ago

🚀 We just released rLLM v0.2.1 — packed with several exciting new features! What’s new: - rLLM SDK (preview): Turn your agents written in any frameworks (e.g. LangGraph, Strands) into continuous learners. - Tinker backend: Run serverless RL training with Tinker as the backend. - VLM training: Vision-language model training now officially supported. - LoRA fine-tuning: Enable LoRA in rLLM with a single config tweak. - Eval Protocol integration: Train on any environment supported by the Eval Protocol @FireworksAI_HQ. More examples + docs in the repo: Github: https://t.co/vR6hSBnPja Docs: https://t.co/CfB2pgXnUq

kylepmont retweeted

Nicholas Crispino

@NRCrispino

6 months ago

Excited to share our latest work, now on arXiv and at FoRLM @ NeurIPS'25! 🎉 Introducing **LLM Chess**: a benchmark for evaluating reasoning and instruction-following in LLMs through chess. LLMs now reach experts in math & coding, but can they *reason* in dynamic, multi-step strategic environments? We tested 50+ models. The results? Many models struggle to beat an opponent making *random* moves, and even powerful reasoning models cannot beat a *weak skilled opponent*. Why chess? It's been the "drosophila of AI" since the 1950s, used as a measuring stick for AI progress and a testbed for planning, strategy, and long-horizon decision-making. Unlike static benchmarks that get contaminated or saturated, chess offers: ✅ Dynamic, stochastic gameplay ✅ Adjustable difficulty via engine skill ✅ Resistance to memorization Our setup: LLMs play in an agentic environment, making moves through tool calls. **Phase 1:** 50+ models play 30 games each vs a random agent, a simple test that many models *fail* due to instruction-following failures or poor performance. **Phase 2:** Top reasoning models face the Komodo Dragon engine at various Elo scores from 250 to 1375 for performance estimation grounded in the real world (tied to chess. com Elo). Key findings for Phase 1: ♟️ Reasoning models crush non-reasoning: **45.4% vs 0.7%** win rate, with many models struggling to reach even 50% Win/Loss vs a random player ♟️ Instruction failures **3× higher** in non-reasoning models (71.9% vs 24.4%) ♟️ Test-time scaling for reasoning effort boosts performance up to **+20%** Key findings for Phase 2: 📉 The best LLM we tested (o3-low) peaks at only **~758 Elo**. While LLMs match experts in math & coding, they play chess around the average online player (~611 Elo on chess .com) and far below human masters (~2800 Elo). 🔄LLM Chess is extensible. As models improve, we scale difficulty. No saturation, no contamination. Check it out and let us know what you think! We are continually evaluating more models on the benchmark. Come and see us at the FoRLM workshop at 3:00-4:15pm on Sunday December 7th, 2025 @ Upper Level Room 33ABC at NeurIPS! 📄 Paper: https://t.co/NgKlPyzP6h 🏆 Leaderboard: https://t.co/fgSKij6SQd 💻 Code: https://t.co/ngUdQGjLf0 Huge thanks to @msmxm, @SaiKolasani1, @nrcrispino, @kylepmont, @matei_zaharia, @jaredq, @Chi_Wang_, @ChenguangWang 🙏

NRCrispino's tweet photo. Excited to share our latest work, now on arXiv and at FoRLM @ NeurIPS'25! 🎉

Introducing **LLM Chess**: a benchmark for evaluating reasoning and instruction-following in LLMs through chess.

LLMs now reach experts in math & coding, but can they *reason* in dynamic, multi-step strategic environments? We tested 50+ models. The results? Many models struggle to beat an opponent making *random* moves, and even powerful reasoning models cannot beat a *weak skilled opponent*.

Why chess? It's been the "drosophila of AI" since the 1950s, used as a measuring stick for AI progress and a testbed for planning, strategy, and long-horizon decision-making.

Unlike static benchmarks that get contaminated or saturated, chess offers:
✅ Dynamic, stochastic gameplay
✅ Adjustable difficulty via engine skill
✅ Resistance to memorization

Our setup: LLMs play in an agentic environment, making moves through tool calls.

**Phase 1:** 50+ models play 30 games each vs a random agent, a simple test that many models *fail* due to instruction-following failures or poor performance.

**Phase 2:** Top reasoning models face the Komodo Dragon engine at various Elo scores from 250 to 1375 for performance estimation grounded in the real world (tied to chess. com Elo).

Key findings for Phase 1:

♟️ Reasoning models crush non-reasoning: **45.4% vs 0.7%** win rate, with many models struggling to reach even 50% Win/Loss vs a random player
♟️ Instruction failures **3× higher** in non-reasoning models (71.9% vs 24.4%)
♟️ Test-time scaling for reasoning effort boosts performance up to **+20%**

Key findings for Phase 2:

📉 The best LLM we tested (o3-low) peaks at only **~758 Elo**.

While LLMs match experts in math & coding, they play chess around the average online player (~611 Elo on chess .com) and far below human masters (~2800 Elo).

🔄LLM Chess is extensible. As models improve, we scale difficulty. No saturation, no contamination.

Check it out and let us know what you think! We are continually evaluating more models on the benchmark.

Come and see us at the FoRLM workshop at 3:00-4:15pm on Sunday December 7th, 2025 @ Upper Level Room 33ABC at NeurIPS!

📄 Paper: https://t.co/NgKlPyzP6h
🏆 Leaderboard: https://t.co/fgSKij6SQd
💻 Code: https://t.co/ngUdQGjLf0

Huge thanks to @msmxm, @SaiKolasani1, @nrcrispino, @kylepmont, @matei_zaharia, @jaredq, @Chi_Wang_, @ChenguangWang 🙏

541

Kyle Montgomery

@kylepmont

8 months ago

Huge thanks to our amazing collaborators @sijun_tan, @YuqiChen74741, @SiyuanZhuang3, @tianjun_zhang, @ralucaadapopa, @ChenguangWang, and the @rllm_project team.

210

Kyle Montgomery

@kylepmont

8 months ago

🚨 New preprint: Budget-aware Test-time Scaling via Discriminative Verification 👉 https://t.co/HOEswBqdmO We show that discriminative verification is the best option for test-time scaling under 25.5 minutes, outperforming state-of-the-art generative verification in both accuracy and efficiency, for example achieving up to +15.3% and +2.8% higher accuracy on AIME2025 at latency budgets of 13.8 min and 15.7 min, respectively. 🧠 Blog: https://t.co/Sw7Hf1FOdt 💻 Code: https://t.co/0xY5XPdTOU 🤗 HuggingFace (data/models): https://t.co/ugkcZL9LcG

kylepmont's tweet photo. 🚨 New preprint: Budget-aware Test-time Scaling via Discriminative Verification 👉 https://t.co/HOEswBqdmO

We show that discriminative verification is the best option for test-time scaling under 25.5 minutes, outperforming state-of-the-art generative verification in both accuracy and efficiency, for example achieving up to +15.3% and +2.8% higher accuracy on AIME2025 at latency budgets of 13.8 min and 15.7 min, respectively.

🧠 Blog: https://t.co/Sw7Hf1FOdt
💻 Code: https://t.co/0xY5XPdTOU
🤗 HuggingFace (data/models): https://t.co/ugkcZL9LcG

Kyle Montgomery

@kylepmont

8 months ago

⏱️ From the latency perspective, the comparison is even more stark. For example, verifying 32 solutions with a 1.5B discriminative verifier is ~1000x faster than generative verification (1.66s vs 1711.8s). Under inference budgets below 22.5 minutes, hybrid discriminative verification outperforms generative verification by up to 15.3% on AIME2025. Discriminative methods avoid decoding bottlenecks and remain practical where generative verification quickly becomes infeasible as the number of solutions of verifications is scaled.

104

Kyle Montgomery

@kylepmont

8 months ago

Thrilled to have been a part of this release — looking forward to what’s coming next with rLLM!

rLLM

@rllm_project

8 months ago

🚀 Introducing rLLM v0.2 - train arbitrary agentic programs with RL, with minimal code changes. Most RL training systems adopt the agent-environment abstraction. But what about complex workflows? Think solver-critique pairs collaborating, or planner agents orchestrating multiple workers. These are hard to express with traditional RL abstractions. v0.2 introduces AgentWorkflowTrainer, built on a simple insight: any agentic flow is just a Python program orchestrating LLM calls, so we made ANY Python program trainable. Researchers and developers can now quickly prototype new ideas or transform their production agentic systems into trainable flows with minimal changes. rLLM now uses official @verl_project ==0.5.0 as our backend (no more custom verl forks!). Just define your agentic workflow or multi-agent system as a Python program and hit train, and rLLM will handle the rest. Since release, rLLM has been adopted to power RL training of world-class agents like @Ali_TongyiLab's DeepResearcher. With this new release, we're working towards building the RL application stack for next-gen agentic AI - where entire systems learn and evolve together, not just individual components in isolation. 📖 Blog post: https://t.co/zmv785DbVo 👨‍💻 GitHub: https://t.co/vR6hSBnPja What agentic program will you train first? 👀

rllm_project's tweet photo. 🚀 Introducing rLLM v0.2 - train arbitrary agentic programs with RL, with minimal code changes.

Most RL training systems adopt the agent-environment abstraction. But what about complex workflows? Think solver-critique pairs collaborating, or planner agents orchestrating multiple workers. These are hard to express with traditional RL abstractions.

v0.2 introduces AgentWorkflowTrainer, built on a simple insight: any agentic flow is just a Python program orchestrating LLM calls, so we made ANY Python program trainable.

Researchers and developers can now quickly prototype new ideas or transform their production agentic systems into trainable flows with minimal changes.

rLLM now uses official @verl_project ==0.5.0 as our backend (no more custom verl forks!). Just define your agentic workflow or multi-agent system as a Python program and hit train, and rLLM will handle the rest.

Since release, rLLM has been adopted to power RL training of world-class agents like @Ali_TongyiLab's DeepResearcher.

With this new release, we're working towards building the RL application stack for next-gen agentic AI - where entire systems learn and evolve together, not just individual components in isolation.

📖 Blog post: https://t.co/zmv785DbVo
👨‍💻 GitHub: https://t.co/vR6hSBnPja

What agentic program will you train first? 👀

140

144

54K

595

Kyle Montgomery

@kylepmont

10 months ago

Swing by the📍KnowFM workshop at #ACL2025 to learn more. Huge thanks 🙏 to David Park, @TuJianhong, @bemikelive, @belizgunel, @dawnsongtweets, and @ChenguangWang.

118

Kyle Montgomery

@kylepmont

10 months ago

Excited to share our latest work at KnowFM at #ACL2025. Predicting Task Performance with Context-aware Scaling Laws models performance on downstream tasks as a function of training compute and context length – ✅ simple, ✅ interpretable, and ✅ effective.

Kyle Montgomery

@kylepmont

10 months ago

Finally, we verify that our fits extrapolate well to out-of-distribution amounts of compute and context, showcasing the usefulness of our method for long-context scaling experiments.

Kyle Montgomery

@kylepmont

about 1 year ago

Excited to share our work at #ICLR2025! JudgeBench ⚖️ tests the reliability of LLM-based judges with a focus on objective correctness. JudgeBench converts tough 🧠 datasets in knowledge, reasoning, math & code into labeled response pairs, forcing objective grading over vibes. Even strong models like GPT-4o barely beat random guessing 🎲. Swing by📍Poster #227 (Session 4) to see how your favorite model fares. Huge thanks 🙏 to @sijun_tan @SiyuanZhuang3 William Tang @ChenguangWang @ralucaadapopa & Ion Stoica!

Kyle Montgomery

@kylepmont

Last Seen Users on Sotwe

Trends for you

Most Popular Users