Just caught up with the recent GLM-5.2 release. The best open-weight model today.
Architecture-wise, it's build on the GLM-5 and GLM-5.1 architecture that I covered previously, which means it's reusing the Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention (DSA) mechanisms from DeepSeek V3.2. (I wrote about it here: https://t.co/tuunazfQ8y)
What's new is that they added an IndexShare mechanism. (That's a cross-layer reuse trick for DSA where instead of recomputing the sparse-attention top-k indexer in every layer, GLM-5.2 runs the full indexer only once every four layers and lets the following layers reuse those selected token indices. This keeps the same DSA idea but makes 1M-token inference much cheaper.)
Wow.
@Zai_org GLM 5.2 is a marvel! It is *at least* as good as Opus 4.8 and GPT 5.5. It's super fast, inexpensive, and not too verbose.
It responds with nuance and judgement, & handles long context VERY well.
I've never experienced an open weights model like this before.
@FastCompany just published a great piece on @theworldlabs , @drfeifei , Marble, and the idea that spatial intelligence / world models may be one of the next big shifts in AI.
I was happy to be quoted in the article, but I also wanted to share more context about my own experience with World Labs and Marble, and why this direction is especially interesting to me.
https://t.co/mdWBmSuNBe
My starting point: volumetric capture
—
For the past few years I’ve been exploring and using volumetric capture and reconstruction (photogrammetry, NeRFs, 3D Gaussian Splats) mostly capturing locations around Montreal. Alleys, museums, urban interiors.
I love every step of it: the capture itself, the pipeline, and what can be done with the output. Turning real spaces into real-time explorable systems.
I do this personally, sharing explorations here, and professionally as chief technologist, and co-founder of Dpt.
Physical reality + generative manipulation
—
In my work I’m especially drawn to mixing physical reality with generative and digital manipulation: using physical interfaces (light, clay, ink, ... ) to drive generative AI pipelines, building mixed reality prototypes that reshape your surroundings, or starting from real captured spaces and transforming them using tools like Marble.
Like many people, I saw the World Labs announcement on Twitter in September 2024, and Marble when it surfaced in early December. But by then, I already had a sense something was coming.
The first conversation
—
As someone deep into volumetric capture and radiance fields, I obviously knew about @BenMildenhall and his pioneering work on NeRF. To my surprise, Ben reached out to me in late June 2024. He’d been following some of my experiments and wanted to chat about my process and workflows and how I was using this “stuff” creatively.
At that point he didn’t share what he was building, but we had a genuinely great conversation about radiance fields, AI, and my work. He was curious about the creative perspective, not just the technical one.
When the World Labs announcement dropped a few months later, it all made sense. I understood what Ben had been working on, and why the creative angle mattered to them. Then in August 2025, he invited me to try the Marble beta, and I’ve been experimenting with it since.
Experimenting with Marble
—
The first thing I used Marble for was materializing scene and world concepts during ideation at the studio, and seeing if and how it could fit into our production pipeline. In parallel, I dove into a series of experiments focused on world manipulation: starting from real captured spaces and transforming them using Marble.
I’d already been exploring that idea using img2img diffusion with ControlNet on NeRF renders, real-time video streams, and even mixed reality using headset camera feeds. But Marble brings something different. It generates persistent, spatially cohesive 3D worlds that can be rendered in real time across a wide range of devices.
That’s a real shift.
Experiment 01: Parallel Realities
—
The first experiment, Parallel Realities, starts from a volumetric capture of a real location, reconstructed as 3D Gaussian Splats. Using Marble, I generate an alternate version of that same space, something informed by the original architecture: abandoned, nature-reclaimed, alternate era.
Then, using Spark (World Labs’ 3D Gaussian Splatting renderer for THREE.js) I make both realities coexist in the same spatial coordinate system. From there, I use a portal UX mechanic to let the user step between the real reconstruction and the Marble-generated version.
Experiment 02: Hidden Depth
The second experiment, Hidden Depth, does not transform a space as much as expand it.
A captured location has a visual boundary (a mural, a doorway, a dark corridor) and Marble generates what exists beyond it. For example: a Montreal alley has a painted mural; step through it and you’re inside a world informed by what is actually depicted there.
World Labs showcased part of this work here:
https://t.co/0RQTDWsgs2
And in their Spark 2.0 post:
https://t.co/X34yzkLBOm
The project page is here:
https://t.co/T6Qxuuq9RJ
Why this matters to me
—
Being able to start from a real 3D Gaussian Splat scene and manipulate it with Marble opens up a lot of ideas. The 3DGS pipeline is becoming an increasingly compelling foundation for exploration, experimentation, and storytelling.
What matters most to me right now is more control. The more I can steer the generated scene or world, the more useful the tool becomes. I want more features like the already existing multiple input images and Chisel, the blockout-based approach.
I would like better local control, the ability to expand a generated world more and more while preserving coherence, and the ability to directly import 3D Gaussian Splat scenes to be used as a starting point. I want more ways to shape the result, not just a “prompt and hope” approach.
—
It is exciting to see this field moving from research and demos toward actual creative workflows.
Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling
But representation learning has relied on strong assumptions like augmentations, masking, cropping, etc... until now!
🎬 Introducing Temporal Difference in Vision (TDV), a new paradigm for representation learning built on a single assumption: causality
TL;DR:
- We introduce TDV, the first approach to learn good representations without any augmentations, masking, cropping, or pixel-based reconstruction
- TDV matches SOTA recipes like DINO and iBOT on dense spatial tasks
- We show that as data scales, weaker assumptions work better
🧵Thread:
The perfect LLM has over 10.5 quadrillion params
This number is based on actual research, but how?
Almost a year ago a paper called "Pre-training under infinite compute" came out of Stanford's most famous LLM researchers [1/6]
LLMs tell me that for each 1 token of text there are 1B tokens of vision (video frames)
So if vision follows similar scaling laws (it doesn't) and 1 vision token is worth 1 text token (it isn't) the perfect vision encoder has 10 billion quadrillion parameters
Beautiful paper from Google DeepMind.
Explains the pathways from AGI to ASI, and why that jump could happen through several routes.
The authors frame the AGI-to-ASI transition around 4 technical pathways:
- continued scaling of compute, model size, data, and test-time inference;
- algorithmic paradigm shifts beyond today’s transformer-based foundation-model stack;
- recursive self-improvement, where AI accelerates AI R&D and improves future systems; and
- multi-agent collective intelligence, where large populations of specialized agents coordinate into a superhuman group agent.
Scaling may work for a while, but it could hit limits in data, compute, energy, or weaker returns from making systems larger.
Recursive improvement is the most uncertain path, because AI could speed up AI research, but that loop may also slow if hard research problems need real-world testing, scarce hardware, or new ideas.
Multi-agent collectives may be the most underappreciated path, because a society of competent digital workers could outperform a brilliant individual model through specialization, speed, and coordination.
The big point is that ASI may not arrive as 1 sudden event, but as a chain of faster changes as AI helps create better AI and stronger scientific tools.
----
Link – arxiv. org/abs/2606.12683
Title: "From AGI to ASI"
Today's release:
Have an update on Robotics Memory reading list, including new models, new benchmarks and new ideas, check it out!
https://t.co/8g0lE9qbfl
Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation
https://t.co/c9AvsRKybj
What if we didn’t have to hold an entire neural network in memory to train it?
Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network.
In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance.
With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block.
How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently.
We validated this across five different architectures:
• ViT
• DiT
• Masked diffusion
• Autoregressive transformers
• Recurrent-depth transformers
In each case, performance is competitive with end-to-end training while using a fraction of the memory.
This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training.
Read our paper and code, to learn more.
Paper: https://t.co/CRj96VGYQn
GitHub: https://t.co/eNW0K9Xh8E
🐟
How can VLAs achieve 95+% reliability?
Using RL post-training with EXPO-FT:
- π0.5 improves to 30/30 success on all 8 tasks tested
- uses only 19 min of RL data on average
Paper & videos: https://t.co/54nO9tFU0Z
New Meta, Stanford, Google and many other top labs paper proposes AutoResearchClaw.
Shows that automated research improves when AI can fail, recover, and ask humans at the right moments.
The paper is less about an “AI scientist” than about turning research into a governed loop.
Most systems still treat science like a production line: generate an idea, run code, write a paper, then stop when the chain breaks.
AutoResearchClaw treats failure as evidence, using debate, repair, verification, memory, and selective human input as parts of the same machine.
That is the main point: autonomy gets better when it is constrained by process, not when it is simply given more freedom.
On ARC-Bench, the system beat AI Scientist v2 by 54.7%, with its sharpest gains in result analysis, where claims had to match measurements rather than merely sound plausible.
The human result is more interesting: CoPilot reached an 87.5% accept rate, while full autonomy reached 25% and step-by-step oversight reached 50%, suggesting that too little judgment and too much supervision can both degrade science.
The most revealing failure was a case where every cross-validation method returned identical zero-bias outputs, which passed numeric verification but failed scientific meaning.
That is the boundary this paper exposes: machines can verify that numbers are real, but humans still notice when the experiment has stopped asking the right question.
----
Paper Link – arxiv. org/abs/2605.20025
Paper Title: "AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration"
Language Models Need Sleep
"Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache."
"increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."
One of the hottest terms in AI right now is "On-policy distillation".
It is a post-training technique in which a student model, typically an LLM, samples from its current policy and receives a teacher signal for on-policy states. It combines the dense supervision of distillation with the locality of online RL.
Now a method on PapersWithCode!
Find all 183 papers that cite it, and more here: https://t.co/NIsUjyU3UP
What if a robot could map without training or 3D labels?
HKUST & MBZUAI researchers present FreeOcc – a training-free open-vocabulary occupancy predictor. It builds a 4-layer map using SLAM, Gaussians, and VLMs.
Outperforms self-supervised by 2x in IoU/mIoU, zero-shot to new scenes.
FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction
Paper: https://t.co/cBUqlxpn58
Project: https://t.co/2tXuBi7Y4P
Code: https://t.co/wqbPBjSGkH
Our report: https://t.co/9YSmTYgXMw
📬 #PapersAccepted by Jiqizhixin
Google just figured out why AI lies with confidence.
Large language models still make confident mistakes on simple factual questions.
A new paper from Google Research explains why this keeps happening.
Models cannot reliably tell what they know from what they are guessing.
The internal score separating right answers from wrong ones sits around 0.70 to 0.85.
Forcing strict accuracy backfires.
Cutting errors from 25% to 5% means staying silent on over half of correct answers.
The team proposes faithful uncertainty.
The model's words should match its actual internal confidence.
Instead of refusing to answer, it hedges honestly.
"I think" becomes a real signal, not filler.
This same awareness tells agents when to reach for search tools.
The paper flags open problems worth tackling:
> Static training versus shifting knowledge
> Alignment erasing confidence signals
> Misleading calibration metrics dominating evaluation
Added a DeepSeek Sparse Attention (DSA) from-scratch implementation to my LLMs-from-scratch repo thanks to an awesome new reader contrib.
With motivation, overview, and GPT-style model reference implementation as standalone example code: https://t.co/o2PMhjF0TN
Why can't robots react instantly to fast-changing environments?
Researchers from HKU and ACE Robotics introduce FASTER.
Instead of running all sampling steps before any movement, it uses a Horizon-Aware Schedule to compress the immediate action into a single denoising step.
Result: 10x faster reaction latency, enabling real-time table tennis on consumer GPUs.
FASTER: Rethinking Real-Time Flow VLAs
Paper: https://t.co/S5zS3XRQ50
Project: https://t.co/CFfg8dM4gz
Code: https://t.co/K2Eb9sZcKl
Our report: https://t.co/sJXF5XF1l9
📬 #PapersAccepted by Jiqizhixin
Alibaba released Qwen 3.7 max.
It ran unsupervised for 35 hours, made 1,158 tool calls, and rewrote a GPU kernel until it was 10x faster.
The core idea is simple: agentic skills improve the same way language skills do, through exposure to diverse environments during training. More varied environments, better generalization.
Here's what that unlocks in practice:
- Works across any agent framework
- Handles coding end-to-end
- Runs productivity workflows via tool integrations
That 35-hour run wasn't a broad self-improvement sweep. It was one model grinding through compile-profile-rewrite loops on a single well-defined target until the job was done.
That's not a chatbot completing tasks. That's something closer to an engineer iterating through solutions.
The model is available via API now.