UTCS is a recognized leader in creating the scientific knowledge and practical technologies exemplifying the digital revolution that defines the 21st century.
@UTAustin is launching a new School of Computing in fall 2026! With Information and Statistics & Data Science, we’ll expand student opportunities, accelerate research, and strengthen pathways to high-impact careers and grad study.
Read more: https://t.co/Kjg3k1gZKZ
🤖 How can robots learn long-horizon object state change tasks like mashing a banana 🥣🍌, spreading ketchup on bread 🍅🍞, or slicing a cucumber 🔪🥒?
Introducing SPARTA: object state-change manipulation via visual spatial progress 👇
🌐 https://t.co/pzRiSYqHjc
Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.
In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.
1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.
2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.
3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.
4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.
5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!
Thread 🧵👇
Exciting news on GR00T:
NVIDIA announces our first open humanoid robot platform, featuring Unitree H2 Plus and Sharpa hands, to accelerate academic research and facilitate cross-institutional collaboration.
R&D in humanoid robotics needs broader participation. Open science is how we build the future faster, together.
Delighted to finally unveil these results! 🎉
Many congratulations to the team, who worked tirelessly for almost a year to build and evaluate AlphaProof Nexus. We revised many priors during this project — most notably, we discovered that with current frontier models, simple agent loops with compiler feedback can rival more sophisticated systems. We were struck both by the capabilities of our systems and the magnitude of the challenges ahead.
I have never been as excited about the potential of formal math to enhance human creativity and bring rigor to AI. Onward! 🚀
Excited that our paper StreamdiffusionV2 received the Best Research Paper Award at #MLSys26!
🚀Video generation is quickly moving from demos to production-facing workloads. It is no longer a turn-based pipeline but should be a streaming pipeline to interact with users.
📖Our project page: https://t.co/ItuO5zc6hT and paper: https://t.co/fmz2irYIm1
👂Come join the talk if you are interested in streaming video generation. Our talk will be at the Research Track Oral Presentation: Best Paper Session on Tue 8:45AM at #MLSys26 , I will talk about how we attacked the efficiency and quality challenges. Hope to see you there!
❤️Huge thanks to all authors! This work would not have been possible without the incredible effort from the entire team. Big shout out to Tianrui Feng, Zhi Li, @Andy_ShuoYang , @HaochengXiUCB, @lmxyy1999 , @lvminzhang , @xiuyu_l , Keting Yang, @ZiqiPeng, @songhan_mit , @magrawala, @KurtKeutzer , and @cumulo_autumn
Are state-of-the-art AI review systems capable of providing meaningful reviews in an actual AI conference? This paper explains the findings from the AAAI 2026 AI Review Pilot 1/N
🚨 Excited to share MINTEval, a new benchmark for memory with interference. In real-world settings, agents need to handle continuously changing info (think of all your v2.5_final_final docs) .
MINTEval tests memory systems on frequent and interfering changes, across challenging question types (long-range lookback/recover, multi-target reasoning) and 4 realistic domains that challenge even the strongest models/agentic memory systems.
🧵👇
Sparse binary rewards bottleneck LLM RL, motivating the use of privileged information in self-distillation as dense teachers. How can we use and balance multiple types of privileged info: leveraging stable cross-view info, while preserving view-specific info?
Current on-policy self-distillation methods often condition the teacher on only one type of privileged view: full solution, partial rationale, answer-only, reference code, feedback, etc. This can be suboptimal:
1️⃣ No single privileged view consistently performs best when used as a teacher.
2️⃣ Views can introduce teacher-specific artifacts from information unavailable to the student.
🧠 Adaptive-View Self-Distillation (AVSD) considers multiple privileged views jointly as a teacher family, balancing cross-view consensus and view-specific signals through a token-level gate to construct better dense learning signals.
🧵👇
LLM agents & memory systems operate in continuously updated environments (Git repos, evolving docs). They must process long contexts, recover earlier information, and reason over many updates that create interference between old and new information. How well do they handle this?
We introduce MINTEval:
✅ Frequent context changes & interference (avg. 86 updates)
✅ 5 challenging question types, including long-range lookback & reasoning over multiple targets distributed across context
✅ 4 realistic domains: state tracking, multi-turn dialogue, Wikipedia revisions, GitHub commits
✅ Avg. 138.8k tokens per instance (up to 1.8M)
✅ Human verification on generated QAs = 95.6%
📊 Across 7 representative systems, MINTEval remains difficult, showing an avg. acc of 27.9%, and the best system reaches only 33.4%.
🔎 Our analysis shows:
• Memory construction failures cause a 41.7% drop
• Memory agents are highly sensitive to design choices
• Memory systems have a strong bias toward insertion operations (76.8%) over deletion/update
Introducing the 2026 @MLCommons Rising Stars! 🌟
We’ve selected 39 outstanding early-career researchers from 26 global institutions who are shaping the future of ML systems, hardware-software co-design, and trustworthy AI.
Meet the cohort: https://t.co/yovGC0i7Sv
#AI#MLCommons
Yeonju’s research tackles reliability and efficiency in agentic LLM workflows. In her recent work Sherlock 📷, informed verification + speculative execution deliver +18.3% accuracy and up to 48.7% latency reduction. Worth a read! https://t.co/nszLUBtZzU
Huge congrats to our very own @j777ro on being named a 2026 @MLCommons ML & Systems Rising Star, one of just 39 awardees this year! 📷 She’ll be heading to the workshop hosted by @AMD in Santa Clara this July. So well deserved! https://t.co/k0F5u6pyln
Digital Demo Day brought our community together one more time this year to celebrate the creativity and hard work of students across all UT departments — and wow, did they deliver.
Thank you for being part of this semester's story. See you in the fall! 🎮🤘
📢 I’m looking to hire a postdoc to work closely with me and my research group at UT Austin on exciting topics in core PL/FM, as well as applications of PL/FM ideas to other areas. If you are interested, or know someone who might be a great fit, please DM me!
🚨Excited to announce Agent-BRACE!
LLM agents in long-horizon POMDPs either blow up their context with raw history or summarize it, discarding uncertainty by collapsing belief into a point estimate. Agent-BRACE decouples the agent into belief state + policy models, jointly trained via RL.
Key takeaways:
1️⃣ 🎯The belief state model produces a structured approximation of the belief distribution as a set of atomic natural-language claims with ordinal verbalized certainty labels ranging from certain to unknown. The policy conditions on this compact belief rather than the full history.
2️⃣ 📈 Outperforms strong RL baselines on long-horizon partially observable embodied language environments while maintaining a near-constant context window independent of episode length.
3️⃣ 🔄 The learned belief becomes increasingly calibrated as evidence accumulates, and epistemic belief decreases over time: the proportion of claims that the agent has the strongest level of belief in grows from 21% → 52% over an episode.
👇🧵
[1/n] Just wrapped up 7 months interning with @pcastr at DeepMind and I'm so excited to share our work: https://t.co/SsHsksxO3i.
TLDR: We used LLM-powered program synthesis to automatically model and discover differences between human and LLM strategic behavior
Congratulations, Class of 2026! 🎓
We're excited to celebrate this unforgettable achievement with you at commencement tomorrow! For important updates, text UTGRAD26 to 888777 to receive notifications about parking, traffic, weather and ceremony updates. Hook 'em forever 🤘🧡
The clear bag policy is in place for ALL COMMENCEMENT EVENTS. Guests and graduates will not be allowed to enter with prohibited bags, including umbrellas, signs, and strollers. Review the guidelines before arriving on campus: https://t.co/5II3Y5ljnX
#UT26