A 10 million parameter model just outperformed deterministic rivals 3 times its size by doing something regular recursive AI dont do: exploring multiple reasoning paths at the same time.
Most AI reasoning models are trapped on a single train of thought, and GRAM ("Generative Recursive Reasoning") is the first to break that by letting the model think in parallel universes simultaneously.
The problem is that all existing recursive models are fully deterministic, meaning given the same input they always follow the exact same reasoning path and can never escape a wrong trajectory or discover more than 1 valid answer.
GRAM fixes this by injecting learned randomness at each refinement step, so the model samples a slightly different direction each time rather than snapping to 1 fixed next state, which produces a spread of diverse reasoning trajectories.
At test time the model runs many of these paths in parallel and selects the best one using a small reward predictor trained alongside the main model, adding a "width" scaling axis on top of the usual "depth" axis of running more recursion steps.
On hard Sudoku puzzles, GRAM with 10M parameters hits 97% accuracy versus 87.4% for the best prior recursive model, and with only 20 parallel samples it outperforms every deterministic baseline even at 320 recursion steps.
On tasks with many valid answers like N-Queens, deterministic recursive models collapse as the number of solutions grows, while GRAM maintains near-perfect accuracy throughout.
The same stochastic framework also acts as a generator: given a blank board, GRAM produces valid Sudoku puzzles 99% of the time using 16 steps, versus 1,000 steps and 55M parameters for the best diffusion baseline at just 91%.
---
Paper Link – arxiv. org/abs/2605.19376v1
Lots of @stanfordnlp work at @icmlconf. See you in Seoul! 🇰🇷
Contextualized Privacy Defense for LLM Agents
Yule Wen, @StevenyzZhang, …, @Diyi_Yang
You gave your AI agent access to your email—it’s much more useful then. But how to maintain your privacy?
https://t.co/4ZGY2idfq8
A 30B model just hit gold-medal scores at the world's hardest math contest.
Olympiad math and physics are the hardest reasoning tests around.
Most gold-medal scores come from massive specialized systems built for one subject.
Shanghai AI Lab just released SU-01, a 30B open reasoning model.
It clears the medalist cutoff on the latest IMO, USAMO, and IPhO.
The recipe is the interesting part.
They turn a general backbone into a proof solver in three stages:
1. Curriculum fine-tuning on 338K traces
2. 200 steps of two-stage reinforcement
3. Generate, verify, and revise at inference
No code execution, no symbolic solvers, no external tools involved.
The model sustains over 100K tokens of natural-language proof per run.
Quality came from the training and inference loop, not from scaling parameters.
The weights and recipe are public.
What does this unlock for smaller open-source labs working on scientific reasoning?
Prediction markets don't lie - they're systematically biased
and that bias has geometry
we built a calibration surface C(K, τ) - a 2D error map across strike K and horizon τ
turns out: markets overprice extremes. underprice short horizons. drift optimistic on long ones
this isn't noise - it's structure
C(K, τ) = C_K(K) + C_τ(τ) + C_int
smile * temporal drift * interaction
three layers of systematic error you can measure, decompose, and correct
when MCI(τ) < 0.80 - the market loses price discovery
it's not about whether markets "believe" something - it's about whether they know what they don't know
Google DeepMind just used AlphaEvolve to breed entirely new game-theory algorithms that outperform ones humans spent years designing
the discovered algorithms use mechanisms so non-intuitive that no human researcher would have tried them.
here's what actually happened and why it matters:
Congrats on the launch @simile_ai ! (and I am excited to be involved as a small angel.)
Simile is working on a really interesting, imo under-explored dimension of LLMs. Usually, the LLMs you talk to have a single, specific, crafted personality. But in principle, the native, primordial form of a pretrained LLM is that it is a simulation engine trained over the text of a highly diverse population of people on the internet. Why not lean into that statistical power: Why simulate one "person" when you could try to simulate a population? How do you build such a simulator? How do you manage its entropy? How faithful is it? How can it be useful? What emergent properties might arise of similes in loops?
Imo these are very interesting, promising and under-explored topics and the team here is great. All the best!
Microsoft and Georgia Tech gave existing models the ability to decide how to think.
The model brainstorms in latent space and only writes its thoughts when confident.
It makes them up to 6.78x more efficient.
Are we entering the age of latent reasoning?
Here's how it works:
- Monitor Confidence: The system, SwiReasoning, tracks the LLM's predictive entropy. High entropy means low confidence in the next step.
- Think Silently: When confidence is low, the AI switches to latent reasoning, exploring concepts as continuous soft embeddings instead of generating tokens.
- Write Aloud: Once confidence rises, it switches back to explicit Chain-of-Thought, generating tokens to lock in its logical path.
- Prevent Overthinking: A switch counter caps the number of cycles, forcing a conclusion to maximize token efficiency.
Result: A peak token efficiency gain of 6.78x over standard CoT and an average 56-79% efficiency gain on models like Qwen3-8B.
Most web agents still click around blindly because they never store real knowledge about page parts or user goals.
This work builds Web‑CogReasoner, an agent that learns in 3 clear rounds, memorize facts, grasp concepts, then practice procedures, and thinks through that stack before it moves.
The team first scraped 14 popular sites and shaped 12 tasks, giving 81K fact samples, 62K concept samples, and 27K procedure samples.
These pieces land in Web‑CogDataset, a curriculum that grows from naming a button to finishing a noisy multi step booking.
Training starts on Qwen2.5‑VL‑7B, adds factual labels, then concept summaries, then full action trails, each stage widening the context window to 8K tokens.
During inference the model writes a little diary that lists what it sees, what it means, and which click comes next, so every move is traceable.
A new exam called Web‑CogBench checks memory, understanding, and exploration.
Web‑CogReasoner scores 84% overall, beating Gemini 2.5 Pro by 4 points and the strongest open source rival by 12 points.
On live WebVoyager tasks it finishes 30% of jobs, up from 26% for OpenWebVoyager‑Max, and halves the gap with closed models.
An ablation study proves the Bloom style ladder matters, since adding facts lifts memory by 17 points, concepts lift understanding by 11 points, and procedures lift exploration by 7 points.
Overall this structured crash course in what, why, and how turns vague guesswork into reliable web navigation.
----
Paper – arxiv. org/abs/2508.01858
Paper Title: "Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents"
@paulosalem Human reasoning is not based on auto-regressive discrete symbol (token) prediction.
It is based on the manipulation of mental models in continuous representations spaces.
It is based on *searching* for a set of manipulations of this model to arrive at a particular result.
SEAL: LLM That Writes Its Own Updates Solves 72.5% of ARC-AGI Tasks—Up from 0%
This is a breakthrough that is rarely seen and could open up undreamt-of possibilities. In the following, I will go into more detail and summarize this breakthrough:
This paper analyzes advanced reasoning models' performance by examining their internal steps as reasoning graphs.
Methods 🔧:
→ Extract reasoning graphs by clustering hidden state representations for each reasoning step.
→ Measure graph properties like cyclicity, diameter, and small-world index.
→ Compare these properties between reasoning models and base models on mathematical reasoning tasks.
→ Analyze correlations between graph properties, task difficulty, model size, and supervised fine-tuning performance.
📌 Increased graph diameter reveals wider state exploration drives reasoning performance.
📌 Higher small-world index shows efficient local clustering and global reach in reasoning steps.
📌 Graph properties offer objective metrics for evaluating supervised fine-tuning dataset quality.
----------------------------
Paper - arxiv. org/abs/2506.05744v1
Paper Title: "Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties"
holy shit 😳
some scientists literally predicted the next Pope with math!
these researchers laid out Vatican’s hidden network:
- the connections
- influences
- and indirect relationships
they actually predicted Robert Prevost as a leading candidate before his election.
Super nice paper from Bytedance.
With this, AI learns to code better by first learning how to pick its own high-quality training data.
Releases Seed-Coder: Let the Code Model Curate Data for Itself
Current LLMs for code pretraining heavily rely on manual data curation, which is unscalable and biased.
Seed-Coder introduces LLMs to automate this data curation. Its model-centric pipeline scores and filters code, achieving 36.2% on MHPP with minimal human input.
📌 LLMs curating data for other LLMs cuts human bias, boosting scalability.
📌 Model-driven pipelines evaluate code quality more subtly than fixed rules.
📌 Tailored instruct and reasoning models excel at specific coding tasks.
Methods Explored in this Paper 🔧:
→ A model-centric data pipeline leverages LLMs for scoring and filtering diverse code data sources.
→ The instruct model undergoes supervised fine-tuning and direct preference optimization for better instruction following.
→ The reasoning model employs Long-Chain-of-Thought reinforcement learning to improve multi-step code reasoning.
Existing Graph RAG (GraphRAG) methods struggle because they represent knowledge using only binary relations (linking two entities), missing complex real-world connections involving more than two entities.
This paper introduces HyperGraphRAG, which uses hypergraphs to model these multi-entity (n-ary) relationships directly using hyperedges, improving knowledge representation for LLMs.
HyperGraphRAG shows better accuracy, achieving higher Context Recall (e.g., 60.34 overall) and Answer Relevance (e.g., 85.15 in Medicine) than previous methods.
📌 Hypergraphs intrinsically model multi-entity facts, overcoming the information loss in binary graph representations.
📌 Dual vector retrieval (entities, hyperedges) enables precise fact finding and contextual expansion simultaneously.
📌 Capturing richer relations via hypergraphs improves accuracy, balancing slightly increased construction time and cost.
----------
Methods Explored in this Paper 🔧:
→ LLMs extract n-ary relational facts (hyperedges connecting multiple entities) from text to construct a knowledge hypergraph.
→ A bipartite graph structure stores the hypergraph efficiently in standard graph databases.
→ Vector embeddings represent both entities and hyperedges for semantic retrieval using similarity search.
→ A retrieval strategy first finds relevant entities based on the query, then expands to find connected hyperedges and related entities.
→ Generation combines retrieved hypergraph facts with traditional chunk-based retrieved text for a comprehensive final answer.
----------------------------
Paper - arxiv. org/abs/2503.21322
Paper Title: "HyperGraphRAG: RAG with Hypergraph-Structured Knowledge Representation"
How LLMs acquire factual knowledge during training remains unclear.
This paper investigates these learning dynamics using synthetic biographies, revealing a three-phase process where models first learn statistics, plateau while forming attention circuits, and finally acquire specific facts.
The study proposes data scheduling, like initially using imbalanced data, can accelerate learning by shortening the plateau.
📌 Plateau phase builds essential recall circuits; data scheduling optimizes this pre-knowledge acquisition stage.
📌 Tailoring data distribution (imbalanced/uniform) to distinct learning phases strategically accelerates overall knowledge acquisition.
📌 Fine-tuning struggles because hallucinations emerge with knowledge, rapidly corrupting feed-forward associative memories.
----------
Methods Explored in this Paper 🔧:
→ Attention patching experiments, swapping attention patterns between models, demonstrate that recall circuits develop specifically during the performance plateau phase.
→ Fine-tuning struggles to add new knowledge because hallucinations emerge concurrently with learning, quickly degrading existing memories stored in feed-forward layers.
→ Imbalanced data distributions shorten the plateau phase, while uniform distributions are optimal for the later knowledge acquisition speed, presenting a trade-off.
----------------------------
Paper - arxiv. org/abs/2503.21676
Paper Title: "How do language models learn facts? Dynamics, curricula and hallucinations"
Brain signals and LLM embeddings converge for predicting every spoken or heard word.
Beautiful research from @GoogleAI
They compared human brain activity during real conversations with internal embeddings from a speech-to-text LLM.
Measured electrode signals in speech and language-related brain regions and matched them to the model’s word-level features.
🤖Key Highlights
→ Brain activity aligns linearly with LLM embeddings for real-life spoken conversations.
→ Sequence of comprehension: first speech sounds, then word meaning.
→ Sequence of production: planned meaning, then articulation, then hearing one’s own voice.
→ Consistent predictive coding (pre-onset anticipation, post-onset surprise) mirrors LLM next-word prediction.
→ Lower-tier auditory regions still show partial sensitivity to semantic information.
🤖 Model-Brain Alignment
They observed a clear sequence: during comprehension, auditory cortex (superior temporal gyrus) showed strong correlation with speech embeddings, then language embeddings aligned with Broca’s area.
During production, Broca’s area correlated with language embeddings before articulation, followed by motor cortex signals matching speech embeddings. This suggests that next-word prediction and higher-level meaning representation in the model parallel the brain’s approach.
⚙ So the study revealed a shared computational principle of predicting words in context.
Even though the Transformer-based LLM processes words in parallel layers, the human brain processes them serially yet mirrors similar statistical regularities.
This supports a “soft hierarchy” where both lower-level acoustic processing and higher-level semantic processing partially overlap in the brain.
The Big Ideas (trivial or not, depends on you) are: 1. orthogonalize your alphas; 2. model alpha uncertainty. The first one gives you factor neutrality and separability of the optimization problem (=>closed-form solutions). The second one gives you more effective sizing.
5/