๐ Paper: https://t.co/KgiEGp47oY
๐ป Code: https://t.co/vZSj41AsVv
Model: https://t.co/hi3MelOINY
@DavidDinucu@ndaheim_@idohakimi@IGurevych@mrinmayasachan
Learn now with our open-source tutor directly on your laptop: ollama run https://t.co/lXJnkxLbMa
๐Thinking assistants instead of homework solvers. Most LLMs are helpful at the turn-level but lack planning for long-term student learning. How can we make LLMs more collaborative and better at tutoring?
#EMNLP2025
Altogether, this allows us to train smaller LLMs for tutoring that match or surpass the performance of larger specialized tutoring models while navigating a trade-off between leaking and student solve rate.
๐ Together with ETH Zรผrich and the CSCS, we have just released Apertus, ๐จ๐ญ Switzerlandโs first large-scale, open, multilingual language model โ a milestone in generative AI for transparency and diversity.
Find out more: https://t.co/mDsDg3Dj5e
If you're at ACL, join us for the tutorial "LLMs for Education: Understanding the Needs of Stakeholders, Current Capabilities and the Path Forward" at the BEA workshop (Room 1.85โ86) 9:00-12:30am tomorrow (July 31st) @aclmeeting
AI alignment for tutoring๐ We use full online RL with conversation-level rewardsโnot just single-turn signals like DPO. Did the student actually learn by the end?
Using GRPO, the model learns real teaching strategies like when to hint or when to correct.
Explore models belowโคต๏ธ
This paper introduces an online reinforcement learning framework using simulated student-tutor interactions.
It trains LLMs to prioritize guiding students pedagogically instead of simply revealing solutions, aligning models with better teaching methods.
This helps students learn how to solve problems independently.
Methods ๐ง:
โ The online reinforcement learning method trains the tutor model directly on conversations simulated with a separate student LLM.
โ A custom reward function scores full conversations based on two objectives: increasing the student's success rate after the dialog and ensuring the tutor follows good pedagogical principles.
โ This reward system penalizes the tutor for leaking solutions, promoting guided problem-solving.
โ The framework uses LLM judges to evaluate pedagogical quality.
โ Controllable reward weighting balances these objectives, enabling navigation of the trade-off between student solving gains and pedagogical support.
โ Thinking tags are included to enhance the tutor model's interpretability and instructional planning.
๐ Online Reinforcement Learning using model rollouts directly trains on interactive teaching, avoiding static data limitations.
๐ Reward function lambda explicitly controls the crucial pedagogy versus student success trade-off.
๐ Preservation of reasoning benchmarks demonstrates RL's superior transferability compared to Supervised Fine-Tuning baselines.
----------------------------
Paper - arxiv. org/abs/2505.15607
Paper Title: "From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning"
๐ ๐๐จ๐ฐ ๐ฐ๐๐ฅ๐ฅ ๐๐๐ง ๐๐๐๐ฌ ๐ญ๐๐๐๐ก?
Evaluating LLMs for education is key to making real progress, yet we lack a reliable and simple benchmark. Introducing ๐๐๐ญ๐ก๐๐ฎ๐ญ๐จ๐ซ๐๐๐ง๐๐กโan open-source benchmark designed to assess holistic tutoring capabilities in AI.
๐ค ๐๐จ๐ซ๐ ๐ค๐ง๐จ๐ฐ๐ฅ๐๐๐ ๐ โ ๐๐๐ญ๐ญ๐๐ซ ๐ญ๐๐๐๐ก๐ข๐ง๐ ?
Subject expertise does not always correlate with effective teaching; instead, pedagogy and subject knowledge may present a trade-off.
๐ฏ How do we measure teaching quality?
We train a reward model that scores open-ended teacher responses and accurately distinguishes expert-level from novice teaching.