Shivam Singh

@_sssshivvvv_

Joined February 2021

89 Following

12 Followers

158 Posts

_sssshivvvv_ retweeted

Sebastian Raschka

@rasbt

5 days ago

Just caught up with the recent GLM-5.2 release. The best open-weight model today. Architecture-wise, it's build on the GLM-5 and GLM-5.1 architecture that I covered previously, which means it's reusing the Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention (DSA) mechanisms from DeepSeek V3.2. (I wrote about it here: https://t.co/tuunazfQ8y) What's new is that they added an IndexShare mechanism. (That's a cross-layer reuse trick for DSA where instead of recomputing the sparse-attention top-k indexer in every layer, GLM-5.2 runs the full indexer only once every four layers and lets the following layers reuse those selected token indices. This keeps the same DSA idea but makes 1M-token inference much cheaper.)

rasbt's tweet photo. Just caught up with the recent GLM-5.2 release. The best open-weight model today.

Architecture-wise, it's build on the GLM-5 and GLM-5.1 architecture that I covered previously, which means it's reusing the Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention (DSA) mechanisms from DeepSeek V3.2. (I wrote about it here: https://t.co/tuunazfQ8y)

What's new is that they added an IndexShare mechanism. (That's a cross-layer reuse trick for DSA where instead of recomputing the sparse-attention top-k indexer in every layer, GLM-5.2 runs the full indexer only once every four layers and lets the following layers reuse those selected token indices. This keeps the same DSA idea but makes 1M-token inference much cheaper.)

244

926

100K

_sssshivvvv_ retweeted

Jitendra MALIK

@JitendraMalikCV

5 days ago

We can convert human videos to robot hand-object interaction trajectories in 4D. Enjoy! Paper: https://t.co/AS0ecvTB9I Website: https://t.co/KP4dVqxzUb Code: https://t.co/KfOQTJN8vE Authors:@bhawna_paliwal_,@HarithejaE,@willjhliang, @pabbeel , @notmahi , @JitendraMalikCV

760

467

57K

_sssshivvvv_ retweeted

Jeremy Howard

@jeremyphoward

5 days ago

Wow. @Zai_org GLM 5.2 is a marvel! It is *at least* as good as Opus 4.8 and GPT 5.5. It's super fast, inexpensive, and not too verbose. It responds with nuance and judgement, & handles long context VERY well. I've never experienced an open weights model like this before.

229

493

860K

_sssshivvvv_ retweeted

Hugues Bruyère

@smallfly

5 days ago

@FastCompany just published a great piece on @theworldlabs , @drfeifei , Marble, and the idea that spatial intelligence / world models may be one of the next big shifts in AI. I was happy to be quoted in the article, but I also wanted to share more context about my own experience with World Labs and Marble, and why this direction is especially interesting to me. https://t.co/mdWBmSuNBe My starting point: volumetric capture — For the past few years I’ve been exploring and using volumetric capture and reconstruction (photogrammetry, NeRFs, 3D Gaussian Splats) mostly capturing locations around Montreal. Alleys, museums, urban interiors. I love every step of it: the capture itself, the pipeline, and what can be done with the output. Turning real spaces into real-time explorable systems. I do this personally, sharing explorations here, and professionally as chief technologist, and co-founder of Dpt. Physical reality + generative manipulation — In my work I’m especially drawn to mixing physical reality with generative and digital manipulation: using physical interfaces (light, clay, ink, ... ) to drive generative AI pipelines, building mixed reality prototypes that reshape your surroundings, or starting from real captured spaces and transforming them using tools like Marble. Like many people, I saw the World Labs announcement on Twitter in September 2024, and Marble when it surfaced in early December. But by then, I already had a sense something was coming. The first conversation — As someone deep into volumetric capture and radiance fields, I obviously knew about @BenMildenhall and his pioneering work on NeRF. To my surprise, Ben reached out to me in late June 2024. He’d been following some of my experiments and wanted to chat about my process and workflows and how I was using this “stuff” creatively. At that point he didn’t share what he was building, but we had a genuinely great conversation about radiance fields, AI, and my work. He was curious about the creative perspective, not just the technical one. When the World Labs announcement dropped a few months later, it all made sense. I understood what Ben had been working on, and why the creative angle mattered to them. Then in August 2025, he invited me to try the Marble beta, and I’ve been experimenting with it since. Experimenting with Marble — The first thing I used Marble for was materializing scene and world concepts during ideation at the studio, and seeing if and how it could fit into our production pipeline. In parallel, I dove into a series of experiments focused on world manipulation: starting from real captured spaces and transforming them using Marble. I’d already been exploring that idea using img2img diffusion with ControlNet on NeRF renders, real-time video streams, and even mixed reality using headset camera feeds. But Marble brings something different. It generates persistent, spatially cohesive 3D worlds that can be rendered in real time across a wide range of devices. That’s a real shift. Experiment 01: Parallel Realities — The first experiment, Parallel Realities, starts from a volumetric capture of a real location, reconstructed as 3D Gaussian Splats. Using Marble, I generate an alternate version of that same space, something informed by the original architecture: abandoned, nature-reclaimed, alternate era. Then, using Spark (World Labs’ 3D Gaussian Splatting renderer for THREE.js) I make both realities coexist in the same spatial coordinate system. From there, I use a portal UX mechanic to let the user step between the real reconstruction and the Marble-generated version. Experiment 02: Hidden Depth The second experiment, Hidden Depth, does not transform a space as much as expand it. A captured location has a visual boundary (a mural, a doorway, a dark corridor) and Marble generates what exists beyond it. For example: a Montreal alley has a painted mural; step through it and you’re inside a world informed by what is actually depicted there. World Labs showcased part of this work here: https://t.co/0RQTDWsgs2 And in their Spark 2.0 post: https://t.co/X34yzkLBOm The project page is here: https://t.co/T6Qxuuq9RJ Why this matters to me — Being able to start from a real 3D Gaussian Splat scene and manipulate it with Marble opens up a lot of ideas. The 3DGS pipeline is becoming an increasingly compelling foundation for exploration, experimentation, and storytelling. What matters most to me right now is more control. The more I can steer the generated scene or world, the more useful the tool becomes. I want more features like the already existing multiple input images and Chisel, the blockout-based approach. I would like better local control, the ability to expand a generated world more and more while preserving coherence, and the ability to directly import 3D Gaussian Splat scenes to be used as a starting point. I want more ways to shape the result, not just a “prompt and hope” approach. — It is exciting to see this field moving from research and demos toward actual creative workflows.

114

53K

_sssshivvvv_ retweeted

Alexi Gladstone

@AlexiGlad

7 days ago

Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling But representation learning has relied on strong assumptions like augmentations, masking, cropping, etc... until now! 🎬 Introducing Temporal Difference in Vision (TDV), a new paradigm for representation learning built on a single assumption: causality TL;DR: - We introduce TDV, the first approach to learn good representations without any augmentations, masking, cropping, or pixel-based reconstruction - TDV matches SOTA recipes like DINO and iBOT on dense spatial tasks - We show that as data scales, weaker assumptions work better 🧵Thread:

AlexiGlad's tweet photo. Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling

But representation learning has relied on strong assumptions like augmentations, masking, cropping, etc... until now!

🎬 Introducing Temporal Difference in Vision (TDV), a new paradigm for representation learning built on a single assumption: causality

TL;DR:
- We introduce TDV, the first approach to learn good representations without any augmentations, masking, cropping, or pixel-based reconstruction
- TDV matches SOTA recipes like DINO and iBOT on dense spatial tasks
- We show that as data scales, weaker assumptions work better

🧵Thread:

827

119

682

81K

_sssshivvvv_ retweeted

Gabriele Berton

@gabriberton

9 days ago

The perfect LLM has over 10.5 quadrillion params This number is based on actual research, but how? Almost a year ago a paper called "Pre-training under infinite compute" came out of Stanford's most famous LLM researchers [1/6]

gabriberton's tweet photo. The perfect LLM has over 10.5 quadrillion params

This number is based on actual research, but how?

Almost a year ago a paper called "Pre-training under infinite compute" came out of Stanford's most famous LLM researchers [1/6] https://t.co/MhJXKQIJkt

968

155K

_sssshivvvv_ retweeted

Gabriele Berton

@gabriberton

6 days ago

LLMs tell me that for each 1 token of text there are 1B tokens of vision (video frames) So if vision follows similar scaling laws (it doesn't) and 1 vision token is worth 1 text token (it isn't) the perfect vision encoder has 10 billion quadrillion parameters

_sssshivvvv_ retweeted

Rohan Paul

@rohanpaul_ai

11 days ago

Beautiful paper from Google DeepMind. Explains the pathways from AGI to ASI, and why that jump could happen through several routes. The authors frame the AGI-to-ASI transition around 4 technical pathways: - continued scaling of compute, model size, data, and test-time inference; - algorithmic paradigm shifts beyond today’s transformer-based foundation-model stack; - recursive self-improvement, where AI accelerates AI R&D and improves future systems; and - multi-agent collective intelligence, where large populations of specialized agents coordinate into a superhuman group agent. Scaling may work for a while, but it could hit limits in data, compute, energy, or weaker returns from making systems larger. Recursive improvement is the most uncertain path, because AI could speed up AI research, but that loop may also slow if hard research problems need real-world testing, scarce hardware, or new ideas. Multi-agent collectives may be the most underappreciated path, because a society of competent digital workers could outperform a brilliant individual model through specialization, speed, and coordination. The big point is that ASI may not arrive as 1 sudden event, but as a chain of faster changes as AI helps create better AI and stronger scientific tools. ---- Link – arxiv. org/abs/2606.12683 Title: "From AGI to ASI"

rohanpaul_ai's tweet photo. Beautiful paper from Google DeepMind.

Explains the pathways from AGI to ASI, and why that jump could happen through several routes.

The authors frame the AGI-to-ASI transition around 4 technical pathways:

- continued scaling of compute, model size, data, and test-time inference;

- algorithmic paradigm shifts beyond today’s transformer-based foundation-model stack;

- recursive self-improvement, where AI accelerates AI R&D and improves future systems; and

- multi-agent collective intelligence, where large populations of specialized agents coordinate into a superhuman group agent.

Scaling may work for a while, but it could hit limits in data, compute, energy, or weaker returns from making systems larger.

Recursive improvement is the most uncertain path, because AI could speed up AI research, but that loop may also slow if hard research problems need real-world testing, scarce hardware, or new ideas.

Multi-agent collectives may be the most underappreciated path, because a society of competent digital workers could outperform a brilliant individual model through specialization, speed, and coordination.

The big point is that ASI may not arrive as 1 sudden event, but as a chain of faster changes as AI helps create better AI and stronger scientific tools.

----

Link – arxiv. org/abs/2606.12683

Title: "From AGI to ASI"

847

175

657

52K

_sssshivvvv_ retweeted

Jie Wang

@JieWang_ZJUI

27 days ago

Today's release: Have an update on Robotics Memory reading list, including new models, new benchmarks and new ideas, check it out! https://t.co/8g0lE9qbfl

_sssshivvvv_ retweeted

Sakana AI

@SakanaAILabs

27 days ago

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation https://t.co/c9AvsRKybj What if we didn’t have to hold an entire neural network in memory to train it? Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network. In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance. With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block. How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently. We validated this across five different architectures: • ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers In each case, performance is competitive with end-to-end training while using a fraction of the memory. This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training. Read our paper and code, to learn more. Paper: https://t.co/CRj96VGYQn GitHub: https://t.co/eNW0K9Xh8E 🐟

366

871K

_sssshivvvv_ retweeted

Chelsea Finn

@chelseabfinn

27 days ago

How can VLAs achieve 95+% reliability? Using RL post-training with EXPO-FT: - π0.5 improves to 30/30 success on all 8 tasks tested - uses only 19 min of RL data on average Paper & videos: https://t.co/54nO9tFU0Z

315

258

40K

_sssshivvvv_ retweeted

Rohan Paul

@rohanpaul_ai

28 days ago

New Meta, Stanford, Google and many other top labs paper proposes AutoResearchClaw. Shows that automated research improves when AI can fail, recover, and ask humans at the right moments. The paper is less about an “AI scientist” than about turning research into a governed loop. Most systems still treat science like a production line: generate an idea, run code, write a paper, then stop when the chain breaks. AutoResearchClaw treats failure as evidence, using debate, repair, verification, memory, and selective human input as parts of the same machine. That is the main point: autonomy gets better when it is constrained by process, not when it is simply given more freedom. On ARC-Bench, the system beat AI Scientist v2 by 54.7%, with its sharpest gains in result analysis, where claims had to match measurements rather than merely sound plausible. The human result is more interesting: CoPilot reached an 87.5% accept rate, while full autonomy reached 25% and step-by-step oversight reached 50%, suggesting that too little judgment and too much supervision can both degrade science. The most revealing failure was a case where every cross-validation method returned identical zero-bias outputs, which passed numeric verification but failed scientific meaning. That is the boundary this paper exposes: machines can verify that numbers are real, but humans still notice when the experiment has stopped asking the right question. ---- Paper Link – arxiv. org/abs/2605.20025 Paper Title: "AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration"

rohanpaul_ai's tweet photo. New Meta, Stanford, Google and many other top labs paper proposes AutoResearchClaw.

Shows that automated research improves when AI can fail, recover, and ask humans at the right moments.

The paper is less about an “AI scientist” than about turning research into a governed loop.

Most systems still treat science like a production line: generate an idea, run code, write a paper, then stop when the chain breaks.

AutoResearchClaw treats failure as evidence, using debate, repair, verification, memory, and selective human input as parts of the same machine.

That is the main point: autonomy gets better when it is constrained by process, not when it is simply given more freedom.

On ARC-Bench, the system beat AI Scientist v2 by 54.7%, with its sharpest gains in result analysis, where claims had to match measurements rather than merely sound plausible.

The human result is more interesting: CoPilot reached an 87.5% accept rate, while full autonomy reached 25% and step-by-step oversight reached 50%, suggesting that too little judgment and too much supervision can both degrade science.

The most revealing failure was a case where every cross-validation method returned identical zero-bias outputs, which passed numeric verification but failed scientific meaning.

That is the boundary this paper exposes: machines can verify that numbers are real, but humans still notice when the experiment has stopped asking the right question.

----

Paper Link – arxiv. org/abs/2605.20025

Paper Title: "AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration"

_sssshivvvv_ retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

28 days ago

Language Models Need Sleep "Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache." "increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."

iScienceLuvr's tweet photo. Language Models Need Sleep

"Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache."

"increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."

911

147

715

66K

_sssshivvvv_ retweeted

Niels Rogge @NielsRogge

29 days ago

One of the hottest terms in AI right now is "On-policy distillation". It is a post-training technique in which a student model, typically an LLM, samples from its current policy and receives a teacher signal for on-policy states. It combines the dense supervision of distillation with the locality of online RL. Now a method on PapersWithCode! Find all 183 papers that cite it, and more here: https://t.co/NIsUjyU3UP

NielsRogge's tweet photo. One of the hottest terms in AI right now is "On-policy distillation".

It is a post-training technique in which a student model, typically an LLM, samples from its current policy and receives a teacher signal for on-policy states. It combines the dense supervision of distillation with the locality of online RL.

Now a method on PapersWithCode!

Find all 183 papers that cite it, and more here: https://t.co/NIsUjyU3UP

128

85K

_sssshivvvv_ retweeted

机器之心 JIQIZHIXIN

@jiqizhixin

30 days ago

What if a robot could map without training or 3D labels? HKUST & MBZUAI researchers present FreeOcc – a training-free open-vocabulary occupancy predictor. It builds a 4-layer map using SLAM, Gaussians, and VLMs. Outperforms self-supervised by 2x in IoU/mIoU, zero-shot to new scenes. FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction Paper: https://t.co/cBUqlxpn58 Project: https://t.co/2tXuBi7Y4P Code: https://t.co/wqbPBjSGkH Our report: https://t.co/9YSmTYgXMw 📬 #PapersAccepted by Jiqizhixin

jiqizhixin's tweet photo. What if a robot could map without training or 3D labels?

HKUST & MBZUAI researchers present FreeOcc – a training-free open-vocabulary occupancy predictor. It builds a 4-layer map using SLAM, Gaussians, and VLMs.

Outperforms self-supervised by 2x in IoU/mIoU, zero-shot to new scenes.

FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction

Paper: https://t.co/cBUqlxpn58
Project: https://t.co/2tXuBi7Y4P
Code: https://t.co/wqbPBjSGkH

Our report: https://t.co/9YSmTYgXMw

📬 #PapersAccepted by Jiqizhixin

_sssshivvvv_ retweeted

AlphaSignal AI

@AlphaSignalAI

about 1 month ago

Google just figured out why AI lies with confidence. Large language models still make confident mistakes on simple factual questions. A new paper from Google Research explains why this keeps happening. Models cannot reliably tell what they know from what they are guessing. The internal score separating right answers from wrong ones sits around 0.70 to 0.85. Forcing strict accuracy backfires. Cutting errors from 25% to 5% means staying silent on over half of correct answers. The team proposes faithful uncertainty. The model's words should match its actual internal confidence. Instead of refusing to answer, it hedges honestly. "I think" becomes a real signal, not filler. This same awareness tells agents when to reach for search tools. The paper flags open problems worth tackling: > Static training versus shifting knowledge > Alignment erasing confidence signals > Misleading calibration metrics dominating evaluation

AlphaSignalAI's tweet photo. Google just figured out why AI lies with confidence.

Large language models still make confident mistakes on simple factual questions.

A new paper from Google Research explains why this keeps happening.

Models cannot reliably tell what they know from what they are guessing.

The internal score separating right answers from wrong ones sits around 0.70 to 0.85.

Forcing strict accuracy backfires.

Cutting errors from 25% to 5% means staying silent on over half of correct answers.

The team proposes faithful uncertainty.

The model's words should match its actual internal confidence.

Instead of refusing to answer, it hedges honestly.

"I think" becomes a real signal, not filler.

This same awareness tells agents when to reach for search tools.

The paper flags open problems worth tackling:

> Static training versus shifting knowledge
> Alignment erasing confidence signals
> Misleading calibration metrics dominating evaluation

297

225

21K

_sssshivvvv_ retweeted

Sebastian Raschka

@rasbt

about 1 month ago

Added a DeepSeek Sparse Attention (DSA) from-scratch implementation to my LLMs-from-scratch repo thanks to an awesome new reader contrib. With motivation, overview, and GPT-style model reference implementation as standalone example code: https://t.co/o2PMhjF0TN

rasbt's tweet photo. Added a DeepSeek Sparse Attention (DSA) from-scratch implementation to my LLMs-from-scratch repo thanks to an awesome new reader contrib.
With motivation, overview, and GPT-style model reference implementation as standalone example code: https://t.co/o2PMhjF0TN https://t.co/jjKyt3aPcR

242

75K

_sssshivvvv_ retweeted

机器之心 JIQIZHIXIN

@jiqizhixin

about 1 month ago

Why can't robots react instantly to fast-changing environments? Researchers from HKU and ACE Robotics introduce FASTER. Instead of running all sampling steps before any movement, it uses a Horizon-Aware Schedule to compress the immediate action into a single denoising step. Result: 10x faster reaction latency, enabling real-time table tennis on consumer GPUs. FASTER: Rethinking Real-Time Flow VLAs Paper: https://t.co/S5zS3XRQ50 Project: https://t.co/CFfg8dM4gz Code: https://t.co/K2Eb9sZcKl Our report: https://t.co/sJXF5XF1l9 📬 #PapersAccepted by Jiqizhixin

jiqizhixin's tweet photo. Why can't robots react instantly to fast-changing environments?

Researchers from HKU and ACE Robotics introduce FASTER.

Instead of running all sampling steps before any movement, it uses a Horizon-Aware Schedule to compress the immediate action into a single denoising step.

Result: 10x faster reaction latency, enabling real-time table tennis on consumer GPUs.

FASTER: Rethinking Real-Time Flow VLAs

Paper: https://t.co/S5zS3XRQ50
Project: https://t.co/CFfg8dM4gz
Code: https://t.co/K2Eb9sZcKl

Our report: https://t.co/sJXF5XF1l9

📬 #PapersAccepted by Jiqizhixin

193

154

20K

_sssshivvvv_ retweeted

CuiMao

@CuiMao

about 1 month ago

终于来了，Qwen基座+Opus思考=Qwopus。可以说是赛博杂交模型的巅峰之作，值得一试。 https://t.co/DHipW7rfgB

126

653

624

88K

_sssshivvvv_ retweeted

AlphaSignal AI

@AlphaSignalAI

about 1 month ago

Alibaba released Qwen 3.7 max. It ran unsupervised for 35 hours, made 1,158 tool calls, and rewrote a GPU kernel until it was 10x faster. The core idea is simple: agentic skills improve the same way language skills do, through exposure to diverse environments during training. More varied environments, better generalization. Here's what that unlocks in practice: - Works across any agent framework - Handles coding end-to-end - Runs productivity workflows via tool integrations That 35-hour run wasn't a broad self-improvement sweep. It was one model grinding through compile-profile-rewrite loops on a single well-defined target until the job was done. That's not a chatbot completing tasks. That's something closer to an engineer iterating through solutions. The model is available via API now.

Shivam Singh

@_sssshivvvv_

Last Seen Users on Sotwe

Trends for you

Most Popular Users