Announcing our COLM 2026 workshop: Scientific Understanding of Foundation Models:
we invite submissions on training dynamics, scaling laws, data and optimization, post-training, reward modeling, evaluation science, reliability, reproducibility, and theoretical understanding of foundation models.
We especially welcome rigorous empirical studies, theory-grounded work, negative results, reproductions, and papers that bridge theory and practice for contributing to this goal.
๐ In person at COLM 2026, San Francisco
๐๏ธ Submission deadline: June 23, 2026, 11:59 PM AoE
๐ https://t.co/icxLzLZafa
๐งโ๐ซ Speakers: @SuryaGanguli, @JikaiJin2002, @zhiyuanli_, @waterluffy, @valentina__py, @lschmidt3, @MohammadShoeybi, @andrewgwils.
Check out William's work with @CosmicAI_Inst to help VLMs close the loop for model fitting & scientific discovery! Including a new dataset of some challenging model fitting problems in research astronomy, which we'll be digging into more!
Hypothesis -> experiments -> analysis -> conclusions. LLMs are great at writing code and conducting experiments.
But thereโs a weakness in their ability to propose statistical models and evaluate their fit.
Enter VESTA: Visual Exploration with Statistical Tool Agents.
Hypothesis -> experiments -> analysis -> conclusions. LLMs are great at writing code and conducting experiments.
But thereโs a weakness in their ability to propose statistical models and evaluate their fit.
Enter VESTA: Visual Exploration with Statistical Tool Agents.
(1/n) New blog from UC Berkeley, UW, and Princeton: Who scales better in long horizon: AI coding agents or top coders?
We compared modern agents to top human contestants in an open-ended coding marathon.
Agents sprinted early. Then they plateaued. Top humans kept improving.
We study this as a new test-time scaling problem: do agents learn better intrinsic test-time strategies, or are they mostly getting more random tries?
1/7
If youโre at #PLDI2026 in Boulder this week, come see what our group has been up to! Weโre presenting work on making code generation more interactive and reliable, speeding up data pipelines, porting network data-plane programs,
Check out Ramya's work on analyzing why and how LLM-generated stories feel homogeneous: the setting you prompt with might be novel but the plot unfolds in a very conventional way. Thread for how we quantified this & compare to existing metrics: ๐
Are LLM-generated stories novel? They can have unique characters and clichรฉ plots, or the other way around. A holistic score doesnโt help distinguish the two ๐.
Meet GENIE ๐ง โ a fine-grained novelty metric that tells you where and why a response is original!
Research highlight! CosmicAI Researchers Wenxuan Ding (NYU), @gregd_nlp (NYU) and external collaborator Nicholas Tomlin (NYU, TTIC) investigated whether LLM agents like Claude Code & OpenAI Codex can navigate cost-benefit tradeoffs in their actions.
https://t.co/5wDVz4iNWA
Spotting the rule from past experience is one thing; acting on it correctly is another. To find out, we introduce HERO's JOURNEY to test for the LLMsโ inductive reasoning ability in multi-step setups. We put an LLM into a text world as a hero๐ฆธโโ๏ธ: it must infer the pattern from past quest trajectories, then apply it to a foe it's never seen.
We found models show signs of rule induction, but scratch the surface: sometimes they're just copying from context. Yet in multi-step execution settings, where humans naturally thrive, the cracks really start to show. ๐งต
In medieval times, within the arms race of ever more demonic torture devices, some sadistic genius came up with the idea of the Little Ease.
This was a prison cell built so small in every dimension that a grown man could not stand upright in it nor lie down at full length nor properly sit.
The pain is relentless and without relief and inflicted by one's own body. Prisoners were known to go insane within a few days. A stay at the Little Ease was considered even more cruel than the rack, the thumbscrew, and the other ghoulish machinery of the Tower of London.
A breeding pig will spend her whole life in a version of that box.
These are social, roaming creatures (more intelligent than dogs) who will never leave this corset of steel.
They have been selectively bred to be bigger than their frames can support. Yet we put them in cells so confined that they cannot comfortably sit, and their attempts to do so (for example, by sneaking their limbs into adjacent stalls) reliably lead to fractures and sprains.
They cannot sweat, yet have nothing to roll around in to cool themselves off. Except their own manure, which (contrary to the common misconception) they are so averse to (thanks to their strong sense of smell) that new sows will often suffer from constipation to avoid soiling the space from which they eat and sleep.
Here is how the writer Matthew Scully described what saw at one of Smithfieldโs โgestation barnโ:
> โSores, tumors, ulcers, pus pockets, lesions, cysts, bruises, torn ears, swollen legs everywhere. Roaring, groaning, tail biting, fighting, and other โVices,โ as theyโre called in the industry. Frenzied chewing on bars and chains, stereotypical โvacuumโ chewing on nothing at all, stereotypical rooting and nest building with imaginary straw. And โsocial defeat,โ lots of it, in every third or fourth stall some completely broken being you know is alive only because she blinks and stares up at you โฆ creatures beyond the power of pity to help or indifference to make more miserable, dead to the world except as heaps of flesh into which the [insemination] rod may be stuck once more and more flesh reproduced.โ
โ
The Save Our Bacon Act is trying to unroll the few state protections we have against this barbaric cruelty - for example Californiaโs Prop 12 - which banned the sale of pork from pigs kept in gestation crates.
Itโs incredibly important we donโt end up with this sort of federal preemption.
SOB will not only kill the most important animal welfare related laws in the US of the past decade, but more importantly, it will also restrict ALL future legislative progress (aka how the animal welfare movement has gotten its biggest wins).
The Senate is currently deciding whether to add the SOB Act to the Farm Bill.
With relatively little money now, we can discourage the most pivotal senators in the Ag committee from backing this amendment.
Defeating this bill is even more important given the amount of philanthropic funding I expect to come online in the next year or two.
It will plausibly be over 10x more expensive to repeal SOB than to prevent it from passing in the first place.
All that money that could be spent transforming our society's relationship to mass animal suffering will instead have to be spent just getting us back to where we are right now.
That's why money spent now fighting this bill (and I mean right NOW) is so effective.
If youโre in a position to donate six figures, please DM me.
Image editing models can put you on the Moon, but can they precisely move a circle right by 50 pixels? ๐
Introducing ๐จPaintBench: a foundational eval of visual editing operations with only one right answer.
The highest-performing model (@NanoBanana 2) reaches only 17.1%.
In the last 48h:
- Jr researcher asked me wheter to use AI in making talks
- Saw two talks, with AI {slop, enhanced} slides
Collected my thoughts and wrote a post. Tl;dr: don't steal your own thinking, don't remove *you* from your talks. Also, give a &#@% about your talks.
Very excited to release DiscoverPhysics, a new benchmark and evaluation pipeline for experimentation and discovery in LLMs.
๐ https://t.co/p3uPtQBJ7G
๐ฐ https://t.co/vUb0cdo6yw
Announcing First Call for Papers: Second Tokenization Workshop ๐ก ๐ฃ
โถ๏ธ Non-archival submissions of two types: Research papers (up to 9 pages)
โถ๏ธ Extended abstracts (up to 2 pages)
Submission deadline June 23, 2026 (AoE)
Acceptance notification on July 24, 2026 (AoE)
New paper! LLM memory keeps improving, but this makes them *worse* as user sims. If we want to build models that can, e.g., simulate realistic students to train chatbots to be better teachers, then these models need to be able to forget like humans do
๐: https://t.co/1GpOfwcsat
COLM 2026 will host 16(!) workshops:
https://t.co/Lf90oZTfiT
CFPs are all online, and deadlines are coming up, so check the CFP of your workshops of interest
The discussion period for COLM 2026 is underway! We're sharing a CDF of average review scores. Note that final decisions will reflect deliberation by ACs and PCs, so these are only meant to be a heuristic guideline to give you a sense of where your papers stand. Good luck!
Submit your work! The 2nd Workshop on ๐๐๐ญ๐ข๐จ๐ง๐๐๐ฅ๐ ๐๐ง๐ญ๐๐ซ๐ฉ๐ซ๐๐ญ๐๐๐ข๐ฅ๐ข๐ญ๐ฒ will be held at COLM 2026 in San Francisco!
Submission Deadline: June 21, 2026
@ActInterp