Our paper, “Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior,” has been selected for Oral Presentation at CTB @icmlconf
* Paper: https://t.co/wT8OQmOs9w
* Website: https://t.co/8idwkMM76N * Code: https://t.co/8pfdiyl71d
A central question in AI evaluation is whether we can use low-cost self-report probes to anticipate how LLMs will actually behave in tasks.
In our earlier work, “The Personality Illusion,” we found that LLMs can give coherent personality self-reports that do not reliably predict behavior. This paper asks a follow-up question: When do self-reports actually track behavior, and what are the failure modes where they don't?
Across 11 LLMs, 4 behavioral tasks, and a 2 × 2 × 2 experimental design, we find that self-report–behavior coherence exists, but it is selective:
1) The instrument matters. Broad Big Five personality traits do not predict task behavior well. But a more behavior-specific framework, the Theory of Planned Behavior, can recover much stronger coherence under favorable conditions.
2) Context matters. When self-reports and behavior happen in the same conversation, coherence can reach human-level intention–behavior baselines. But when they happen in separate conversations, coherence often collapses.
3) The task matters. Coherence survives better for behaviors anchored outside the immediate prompt, such as implicit bias and aspects of honesty. It collapses for behaviors strongly shaped by the local context, such as sycophancy.
4) Personas are not a fix. Persona prompting makes models’ self-reports more stable across conversations, but it does not reliably bring behavior into alignment. This is especially important for persona-customized AI systems: changing what a model says about itself does not necessarily change what it does.
The takeaway: LLM self-reports should not be treated as context-free behavioral diagnostics. If we want to use psychometric probes for AI safety, deployment, or model evaluation, we need task-specific instruments, behaviorally grounded validation, and careful separation between what a model says and what it actually does.
Huge thanks to my co-authors @RKocielnik Pengrui Han, Peiyang Song, Myrl G. Marmarelis, Ramit Debnath, Dean Mobbs, and R. Michael Alvarez, and to the @Caltech Linde Center for Science, Society, and Policy @CaltechLCSSP
Check out our work on end-to-end ultrasound using neural operator for lung aeration https://t.co/CV3Qnh3qCk
We directly reconstructs lung aeration maps from RF data, bypassing the need for traditional beamformers and indirect interpretation of B-mode images.
This is something I have been emphasizing since we started our work on Neural Operators.
We very quickly went from simple fluid dynamics benchmarks to hard problems like building the first high-resolution AI-weather model, FourCastNet, and modeling turbulence in nuclear fusion. For those applications, we got speedup of 10,000 - million times.
Simple benchmarks are great to test new architecture/algorithms work, but not the end.
Neural PDE solvers have seen exciting progress! 🌊
But despite growing adoption, we still don’t know 𝘄𝗵𝗲𝗻 we should use them instead of classical solvers. 🤔
Our new paper has a surprising finding: the harder the PDE task, the more cost-effective learned solvers become. 🧵👇
By capturing temporal correlations in frequency space, Fourier neural operators enable physically faithful modeling of periodically driven quantum systems and the extrapolation of dynamics beyond the training data.
Read more: https://t.co/NiNphCB4fu
I am thrilled that my article in
@americanacad
Daedalus special issue on AI & Science: What Is the Future of Discovery? edited by James Manyika. https://t.co/vvur95HXGI I talk about : How Do We Build AI to Push the Frontiers of Scientific Discovery? Scientific progress is limited not by a lack of new ideas but by the time and cost involved in physical experimentation. Scientific discovery is a needle in the haystack problem: it does not help if AI gives you a vastly bigger haystack. Without knowing if any of the ideas work, an AI system that designs experiments just increases the effort required, since performing the experiments to validate the ideas is the real bottleneck. In my view, AI’s most transformative impact in enabling scientific discoveries lies in reducing the need for such experiments. To get there, we need to build AI models that are able to granularly simulate and understand physics at all scales, rather than just abstractly reason in the textual domain. I explore what methods like Neural Operators have already helped achieve, what still needs to be done, and what lies ahead.
We introduce Sparse Autoencoder Neural Operators (SAE-NOs), a functional framework for representation learning and mechanistic interpretability that treats data as samples from underlying continuous functions and learns mappings between function spaces.
Standard SAEs (SAE-MLP) represent each concept with a scalar activation and a vector-valued dictionary atom, limiting their ability to capture how and where a concept is expressed across structured domains.
SAE-FNO introduces feature-map representations with both concept sparsity and domain sparsity, allowing the model to capture not only which concepts are active, but also where and how they are expressed across the domain.
This is a joint collaboration, between @UAlberta/@AmiiThinks and @Caltech, with Ailsa Shen and @AnimaAnandkumar. 1/
arXiv: https://t.co/EphbL2FJYA
TorchLean codebase is now available!
TorchLean is a Lean 4 framework for verified neural-network software. It supports typed tensors, runnable training, graph IRs, verified autograd, Float32/IEEE semantics, CROWN / IBP-style verification, certificate checking, PyTorch interop, and CUDA/GPU execution.
After feedback and comments on our original post, we expanded TorchLean substantially: neural operators/FNOs, diffusion models, GPT-style text models, GPT-2-style runs, Mamba/state-space models, RL, 3D vision certificates, Bug Zoo case studies, PyTorch interop, and more.
Project page: https://t.co/RZjTQQSGw8
Codebase: https://t.co/NfPQVz9kdu
@Robertljg, Jennifer Cruden, Will Adkisson, Xiangru Zhong, @huan_zhang12@caltech
#MachineLearning #ScientificComputing #Lean #FormalVerification
We’re excited to release TorchLean which is the first fully verified neural network framework in Lean. The Lean community has largely focused on pure mathematics. TorchLean expands this frontier toward verified neural network software and scientific computing. With the recent release of CSlib, we see this as another step toward a fully verified ML stack.
We support features:
1. Executable IEEE-754 floating-point semantics (and extensible alternative FP models) verified tensor abstractions with precise shape/indexing semantics
2. Formally verified autograd system for differentiation of NN programs Proof-checked certification / verification algorithms like CROWN (robustness, bounds, etc.)
3. PyTorch-inspired modeling API with eager-style development + export/lowering to a shared IR for execution and verification
Project page: https://t.co/YHpqhRbMQe
Paper: [2602.22631] TorchLean: Formalizing Neural Networks in Lean
Work done @Robertljg, Jennifer Cruden, Xiangru Zhong, @huan_zhang12 and @AnimaAnandkumar.
#MachineLearning #ScientificComputing #Lean
Nominations are now open for the Pritzker Prize for AI in Science Research Excellence!
This prize honors outstanding researchers advancing both AI and the natural sciences or engineering.
Nominate someone today!
🔗 https://t.co/69Rf7HuHic
Accurate and scalable deep Maxwell solvers
Maxwell's equations are the bedrock of photonic device design, from metalenses to chip-scale wavelength multiplexers. Solving them over realistic device sizes (hundreds of wavelengths, with subwavelength dielectric features) is computationally brutal. Neural network surrogates have been promising on toy problems but rarely scale: fixed domain sizes, narrow parameter ranges, no general boundary conditions, accuracy that degrades as the problem grows.
Chenkai Mao and Jonathan Fan at Stanford propose a different recipe. Instead of training a network to solve the full problem, they train a neural operator on subdomains and plug it into classical iterative methods. The subdomain network is a modified Fourier neural operator that takes arbitrary Robin-type boundary conditions as inputs, used as a flexible preconditioner inside F-GMRES. It gives bounded-accuracy subdomain solutions, and reaches double precision at inference despite single-precision training.
The interesting move is at the global scale. They wrap the subdomain solver in an overlapping Schwarz domain decomposition loop, and use the same network to cheaply solve the subdomain eigenvalue problems that build a coarse space for two-level Schwarz. That coarse correction gives near-optimal scaling, where iteration counts stay roughly constant as the global problem grows.
A single network handles different sizes, resolutions, wavelengths and dielectric distributions, with 20 to 50x fewer iterations than CPU GMRES or BiCGSTAB. They benchmark up to ~3000x3000 grids and 200 wavelengths, then plug the solver into adjoint-based optimization to inverse-design freeform devices: a wavelength division multiplexer, a near-infrared metalens, and a volumetric coupler. Trajectories track ground-truth FDFD almost exactly.
For photonics, semiconductors and optical communications, this makes neural surrogates operationally useful for real device design. Training only a subdomain model and letting iterative methods handle global scaling is a reusable pattern across PDE problems in heat transfer, acoustics and mechanics.
Paper: Mao & Fan, Proc. of the National Academy of Sciences (2026) | journal license https://t.co/WxTJwVPUcn
Excited to share that The Personality Illusion has been accepted to ICML 2026 🥂
We show that LLMs' self-reported personalities are systematically dissociated from their actual behavior :)
Huge thanks to my amazing collaborators and advisors!
@RKocielnik@p_song1 Ramit Debnath, Dean Mobbs @AnimaAnandkumar@rmichaelalvarez
#ICML #Caltech #LLM
Geometric operator learning is challenging because high-quality simulations on complex geometries are expensive. In GeoPT, we pretrain on low-cost graphics datasets augmented with simple dynamics, showing promising scaling behavior.
⚛️ Explore how AI physics can accelerate clean, modular nuclear reactor design.
By leveraging NVIDIA CUDA-X libraries, PhysicsNeMo, and Omniverse libraries, see how nuclear developers address these challenges with GPU-accelerated digital twins. https://t.co/nC0ou965mW
Physics-Informed Neural Operators: Learning The Solver, Not Just One Solution
Our PINN scene learned one solution field for one PDE setup. A Physics-Informed Neural Operator learns the map from input fields, like material coefficients or source terms, to the full solution across a whole family of PDE problems.
So, the goal is no longer just one approximate answer, but a reusable solver-shaped object guided by the physics itself.
🔬 Weekly Science Long Read 🌍
🤖 @Caltech with @AnimaAnandkumar, new @ScienceBoard_UN member: AI can model weather, climate, food, and disease. For more, read the Board's Brief on Verification of Frontier AI.
🌐 Article: https://t.co/jYr8ryIAjO
📘 Brief: https://t.co/Fsl16g5e1Z
✨ Introducing the members of @ScienceBoard_UN!
🌍 @AnimaAnandkumar is the Bren Professor at @Caltech, previously at @nvidia and @awscloud. She is a leading voice on artificial intelligence, machine learning, and AI-for-science.
📣 Read more: https://t.co/eSInS0aCqM