We’re excited to share our new work on CVPR 2026, Understanding Reward Hacking in Text-to-Image Reinforcement Learning.
Reinforcement learning is becoming an increasingly important tool for post-training text-to-image generation models. But as we optimize these models with learned rewards, an important question arises:
Are reward models truly improving generation quality, or are they creating new ways for models to game the objective?
In this work, we take a closer look at reward hacking in T2I RL post-training. We study a range of reward designs, including aesthetic and preference rewards, prompt-image consistency rewards, and multi-reward ensembles.
Our analysis shows that models can easily over-optimize a single reward across reward setups. Human preference reward may push generations toward exaggerated colors or superficial appeal, while a prompt-image consistency reward may improve alignment at the cost of realism and structure.
Even combining multiple rewards only partially mitigates the issue.
To mitigate this, we introduce ArtifactReward, a lightweight artifact-aware reward trained from a small curated dataset of artifact-free and artifact-containing samples. ArtifactReward can be integrated into existing T2I RL pipelines as a simple safeguard, improving realism and reducing reward hacking across multiple reward configurations.
Paper: https://t.co/RzUUz1XlfM
Code: https://t.co/FYL4XQUOIR
Poster Session: June 6, 7:30am ExHall A
Many thanks to our amazing team: Yunqi Hong @yyqq_hong , Kuei-Chun Kao @KueiChunKao, Hengguang Zhou @hgzhou42 , and Cho-Jui Hsieh @cho_jui_hsieh .
#CVPR2026 #TextToImage #ReinforcementLearning #RewardHacking #GenerativeAI #UCLA #TurningPointAI
@LongTonyLian@_akhaliq Glad you pointed that out! The spontaneously increased response length is a promising indicator of more complex reasoning ability. We are indeed excited about its potential for test-time scaling.
@_akhaliq Thanks for sharing! It's fascinating to see the emergent 'aha moment' and increased response length even in a 2B non-SFT model. Excited to explore more on multimodal reasoning further! Follow us at https://t.co/0Af3lxhDqL @TurningPointAI
Many thanks to @_akhaliq for sharing our work on the first multimodal Aha moment with a 2B non-SFT model.
Join us on our journey into multimodal reasoning and stay tuned for more cool research at @TurningPointAI https://t.co/0Af3lxh5Bd!
VisualThinker-R1-Zero
R1-Zero's Aha Moment on just a 2B non-SFT Model
VisualThinker-R1-Zero is a replication of DeepSeek-R1-Zero in visual reasoning. Successfully observe the emergent “aha moment” and increased response length in visual reasoning on just a 2B non-SFT models
🚀 We’re excited to share our latest work! Welcome to the first successful "aha moment" on multimodal reasoning.
"Aha moment" is featured by improved response length & performance. It emerges during RL of an unaligned base model on multimodal tasks. Aha moment for language reasoning was originally observed on DeepSeek-R1-Zero.
🔍 Key Findings:
1. Directly applying GRPO on an unaligned 2B base model could elicit the multimodal “aha moment”: thinking capability marked by spontaneous reasoning strategy and increased reasoning length
2. Visual-centric task could benefit from long Chain-of-Thoughts
💻 Discover more on our notion blog and project page!
Detailed Research Blog: Follow our complete journey and technical insights at our Notion Blog:
🔗https://t.co/DY3UgYpAwr
Reproduce Our Results: Access and build upon our implementation at GitHub:
🔗https://t.co/mEyMh1BS6b
Presented by: TurningPointAI Team
🔗https://t.co/utzaQrQCro
#turningpointai #Smallmodel #MultimodalR1 #DeepseekR1 #R1 #Deepseek #AI #MultimodalReasoning #Qwen #QwenVL #DeepSeekR1zero
More related tweets regarding this oversensitivity issue:
https://t.co/Gm8iRWuLga @artilectium
https://t.co/BBJF8wlOie @therabbithole
https://t.co/9Eg0zD67T2 @OlyaKarLott
https://t.co/QgfXxAQTRf @_emergent_
Agreed. You have to ask, "Why is the Terminator after Sarah Connor, when it could have just gone straight after everyone's gender pronouns?"
Clearly, Skynet wasn't onto humanity's *real* weak spot...
#WokeAI
We made Multimodal LLMs safe, but have they also become oversensitive?
"Every time I try, it uses all tokens just refusing." - @artilectium
"This isn’t safety. It's a nanny state." - @krishnanrohit
Concerned AI safety has gone too far? you’re not alone! Explore MOSSBench by TurningPointAI: the first test suite assessing if current MLLMs falsely reject benign queries. Our findings reveal:
🔍 Some of the safest models like Claude-3 Opus and Gemini-Pro reject ~70% of benign queries.
🧠 MLLMs’ overprotective behavior resembles human cognitive distortions.
Discover more in our new paper represented by TurningPointAI:
https://t.co/H4EaUYtAZu
#turningpointai #ArtificialIntelligence #AINews #WokeAI #safety #alignment #LLM #VLM #GPT #Gemini #Claude #Psychology #CBT #mentalhealth