🚨New paper🚨 Self-Execution Simulation Improves Coding LLMs
Current reasoning LLMs <think> before giving a code solution.
We post-train them to explicitly simulate test execution, in order to verify and fix their own code leading to additional gains!
Details 👇🏽🧵
PhD Submitted🎉🎓
What a ride🎢 who'd have thought over a decade ago coding my (crappy) ASR system😅
So grateful to my amazing supervisor @adiyossLC for always focusing on good science, and letting me make mistakes. And of course my collaborators, lab mates, family and wife❤️
@DimitrisPapail Very cool! You might also find our recent work interesting :) We explicitly post-train models to simulate the execution of their own code, and use that for self-verification and iterative self-refinement 👇
https://t.co/kK9SgmnEGF
🚨New paper🚨 Self-Execution Simulation Improves Coding LLMs
Current reasoning LLMs <think> before giving a code solution.
We post-train them to explicitly simulate test execution, in order to verify and fix their own code leading to additional gains!
Details 👇🏽🧵
@LChoshen Sent a DM :)
I also think it is interesting to understand if your experiment represents a local minima (I.e. sharing a representation would have lower loss), or if under the setting (data, model size) having separate representations would lead to lower train/val loss
@LChoshen We saw cases of knowledge transfer (answering in speech facts only seen in text) when looking at interleaved speech-text LMs like the ones from our paper - https://t.co/tgzJ3X4hhC
We are working now on understanding how this works, what is needed, how to boost:)
Have LLMs become supervised learners (once again)?!
In our new paper, we argue that current LLMs’ post-training methods have effectively reverted to the "pre-train then fine-tune" era, explicitly tailoring models to desired behaviors.
1/n
[1/5] Is Text Enough for Control? 🐇
Text-driven video editing lets you describe *what* to change. But what about *how much*?
We introduce Adaptive-Origin Guidance (AdaOr).
A joint work with @DecartAI and @TelAvivUni 🧪
accepted to #SIGGRAPH2026.
StressTest is heading to #ACL2026! ✌️
Updates include:
☕️ StressPresso, new benchmark for stress tasks generalization.
🧠 More models and more complex reasoning tasks!
Thanks to the GREAT collaborators @GallilMaimon@adiyossLC !
Project and paper: 🔗 https://t.co/pA7FAtaHOJ
Coding models still have a weird blind spot: they can write code, but often can't faithfully predict what that code will do.
"Self-Execution Simulation Improves Coding Models" attacks that gap by training models to mentally execute programs, then using that skill to rank and repair solutions.
@MiniMax_AI Really interesting! “training coding agents relies heavily on execution... which brings massive computational overhead... We are exploring building a World Model capable of predicting code execution results”
Our new work is highly related, check it out👇
https://t.co/kK9SgmnEGF
🚨New paper🚨 Self-Execution Simulation Improves Coding LLMs
Current reasoning LLMs <think> before giving a code solution.
We post-train them to explicitly simulate test execution, in order to verify and fix their own code leading to additional gains!
Details 👇🏽🧵
🚨New paper🚨 Self-Execution Simulation Improves Coding LLMs
Current reasoning LLMs <think> before giving a code solution.
We post-train them to explicitly simulate test execution, in order to verify and fix their own code leading to additional gains!
Details 👇🏽🧵
@arxiv upload "on hold" for over a week and not for the first time🥲 Guess the world isn't ready for the paper😅
No reason provided, reach out responded with generic "wait". Would be useful to get reasons so that next time we can avoid overloading the moderators
🚨New paper🚨 Self-Execution Simulation Improves Coding LLMs
Current reasoning LLMs <think> before giving a code solution.
We post-train them to explicitly simulate test execution, in order to verify and fix their own code leading to additional gains!
Details 👇🏽🧵
🎓💯 Inspired by RL with execution feedback, we also train the model in multi-turn RL to fix/submit its own code given simulated execution feedback.
Turns out you can get boosts relative to strong reasoning models even without tools and real execution!